Problem Statement¶

Business Context¶

A sales forecast is a prediction of future sales revenue based on historical data, industry trends, and the status of the current sales pipeline. Businesses use the sales forecast to estimate weekly, monthly, quarterly, and annual sales totals. A company needs to make an accurate sales forecast as it adds value across an organization and helps the different verticals to chalk out their future course of action.

Forecasting helps an organization plan its sales operations by region and provides valuable insights to the supply chain team regarding the procurement of goods and materials. An accurate sales forecast process has many benefits which include improved decision-making about the future and reduction of sales pipeline and forecast risks. Moreover, it helps to reduce the time spent in planning territory coverage and establish benchmarks that can be used to assess trends in the future.

Objective¶

SuperKart is a retail chain operating supermarkets and food marts across various tier cities, offering a wide range of products. To optimize its inventory management and make informed decisions around regional sales strategies, SuperKart wants to accurately forecast the sales revenue of its outlets for the upcoming quarter.

To operationalize these insights at scale, the company has partnered with a data science firm—not just to build a predictive model based on historical sales data, but to develop and deploy a robust forecasting solution that can be integrated into SuperKart’s decision-making systems and used across its network of stores.

Data Description¶

The data contains the different attributes of the various products and stores.The detailed data dictionary is given below.

  • Product_Id - unique identifier of each product, each identifier having two letters at the beginning followed by a number.
  • Product_Weight - weight of each product
  • Product_Sugar_Content - sugar content of each product like low sugar, regular and no sugar
  • Product_Allocated_Area - ratio of the allocated display area of each product to the total display area of all the products in a store
  • Product_Type - broad category for each product like meat, snack foods, hard drinks, dairy, canned, soft drinks, health and hygiene, baking goods, bread, breakfast, frozen foods, fruits and vegetables, household, seafood, starchy foods, others
  • Product_MRP - maximum retail price of each product
  • Store_Id - unique identifier of each store
  • Store_Establishment_Year - year in which the store was established
  • Store_Size - size of the store depending on sq. feet like high, medium and low
  • Store_Location_City_Type - type of city in which the store is located like Tier 1, Tier 2 and Tier 3. Tier 1 consists of cities where the standard of living is comparatively higher than its Tier 2 and Tier 3 counterparts.
  • Store_Type - type of store depending on the products that are being sold there like Departmental Store, Supermarket Type 1, Supermarket Type 2 and Food Mart
  • Product_Store_Sales_Total - total revenue generated by the sale of that particular product in that particular store

🧠 Reckoning¶

ML Task Type: Regression - predicting sales

Target Variable: Product_Store_Sales_Total - the revenue generated by each product in each store

Problem Understanding:

  • Predict sales revenue at the product-store level (not time series forecasting)
  • Features include product characteristics (weight, sugar content, MRP, type, allocated area) and store characteristics (size, location tier, store type, establishment year)
  • Cross-sectional prediction problem: given a product and store combination, predict the sales total

Likely Success Metrics:

  • RMSE/MAE for model performance
  • MAPE (Mean Absolute Percentage Error) for business interpretability
  • R² for variance explained

Key Insight: Despite the business context mentioning "forecasting for upcoming quarter", the data structure suggests this is a cross-sectional sales prediction problem rather than time series forecasting, since there are no temporal features provided.

🎯 Objective

Predict the Product_Store_Sales_Total (target variable), enabling:

  • Smarter inventory decisions
  • Regional strategy optimization
  • Integration into production systems (via backend + frontend)

⚠️ NOTE:

Despite the term "forecast" being used in the problem statement, the data and setup suggest:

🔍 It's a standard regression task, not a time-based forecast.

Here's why:

  • There's no time feature (like date, week, month).
  • Each row represents a product-store pair, not a time-ordered observation.
  • The target (Product_Store_Sales_Total) looks like an aggregate, not a point in time.

So:

✅ While "forecast" is used for business context, technically it's a supervised regression prediction task.

That distinction is important because it affects:

  • Feature choices
  • Feature engineering & Model types
  • Whether to use time-aware methods (which we don’t need here)

Hence

Our problem context: Predict sales for any product-store combination based on meaningful business characteristics.


Import Libraries¶

In [159]:
# Essential libraries for data analysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import tabulate as tb
import pprint as pp

from scipy import stats
from scipy.stats import skew, zscore

from datetime import datetime
In [160]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error, mean_absolute_percentage_error

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import (
    OneHotEncoder, StandardScaler,
    OrdinalEncoder, PowerTransformer, FunctionTransformer
)
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

from scipy.stats import uniform, randint
In [161]:
#! pip install catboost
from catboost import CatBoostRegressor
from sklearn.linear_model import Ridge
In [162]:
import joblib
from pathlib import Path
In [163]:
import os
from io import StringIO
import sys
In [164]:
from huggingface_hub import login, HfApi
In [165]:
import requests
import json
In [166]:
import warnings
warnings.filterwarnings("ignore")
In [3]:
from IPython.display import HTML # for displaying images in notebook
In [167]:
# Configure display and plotting
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 100)
plt.style.use('default')
sns.set_palette("husl")
#sns.set_palette('mako')

plt.rcParams["figure.figsize"] = (12, 6)

Experiment (Verification)¶

💡 Below code is just for verification (as face some issue with version compatibility)

In [168]:
# print pandas version
print(pd.__version__)
2.3.0
In [169]:
import importlib.metadata
# There was some warning for this dependency (from pandas) So just checking and verifying !!
print(importlib.metadata.version("bottleneck"))
1.5.0

Python Utils¶

In [170]:
def format_pct(val):
    """Format a val as percentage i.e max 2 decimal value & adding % at the end"""
    return f"{val:.2f}%"

def to_pct(value):
    """value is expected to be a normalized float value in [0, 1]"""
    return format_pct(value * 100)

def ratio_to_pct(a, b):
    """
    Convert a ratio to a percentage
    """
    return to_pct(a / b)
In [171]:
def table_it(data, title=None, tablefmt='simple', floatfmt='.2f', orientation='vertical', show_index=False):
    """
    Display single dictionary or pandas Series input as a formatted table

    Parameters:
    -----------
    data : dict or pandas.Series
        The data to display in table format
    title : str, optional
        Optional title for the table
    tablefmt : str, default 'simple'
        Table format style (simple, grid, fancy_grid, etc.)
    floatfmt : str, default '.2f'
        Float formatting for numerical values
    orientation : str, default 'vertical'
        'vertical' for rows, 'horizontal' for columns
    show_index : bool, default False
        Whether to show row numbers/index in the table

    Returns:
    --------
    None - prints formatted table
    """

    if title:
        print(f"\n{title}")
        print("-" * len(title))

    # Handle pandas Series
    if isinstance(data, pd.Series):
        if orientation == 'vertical':
            table_data = data.to_frame()
            print(tb.tabulate(table_data, headers='keys', tablefmt=tablefmt,
                            floatfmt=floatfmt, showindex=show_index))
        else:
            # Horizontal format (with .T)
            table_data = data.to_frame().T
            print(tb.tabulate(table_data, headers='keys', tablefmt=tablefmt,
                            floatfmt=floatfmt, showindex=show_index))

    # Handle dictionary
    elif isinstance(data, dict):
        if orientation == 'vertical':
            if show_index:
                # Add row numbers when show_index=True
                table_data = [[i, key, value] for i, (key, value) in enumerate(data.items(), 1)]
                headers = ['#', 'Key', 'Value']
            else:
                # Standard format without index
                table_data = [[key, value] for key, value in data.items()]
                headers = ['Key', 'Value']
            print(tb.tabulate(table_data, headers=headers, tablefmt=tablefmt, floatfmt=floatfmt))
        else:
            # Horizontal format for dictionary
            keys = list(data.keys())
            values = list(data.values())
            table_data = [values]
            if show_index:
                # For horizontal, we can add index as first row
                index_row = list(range(1, len(keys) + 1))
                table_data.insert(0, index_row)
                headers = ['#'] + keys
            else:
                headers = keys
            print(tb.tabulate(table_data, headers=headers, tablefmt=tablefmt, floatfmt=floatfmt))

    # Handle other iterables (lists, etc.)
    elif hasattr(data, '__iter__') and not isinstance(data, str):
        if show_index:
            # Index is already built-in for this case
            table_data = [[i, value] for i, value in enumerate(data, 1)]
            headers = ['#', 'Value']
        else:
            # Use enumerate starting from 1, but don't show explicit index column
            table_data = [[value] for value in data]
            headers = ['Value']
        print(tb.tabulate(table_data, headers=headers, tablefmt=tablefmt, floatfmt=floatfmt))

    else:
        print(f"Unsupported data type: {type(data)}")
        print("Supported types: pandas.Series, dict, or iterable")
In [172]:
def tb_describe(df_col):
    """
    Helper function to display descriptive statistics in a nicely formatted table

    Parameters:
    df_col : pandas Series or DataFrame column
        The column to generate descriptive statistics for

    Returns:
    None - prints formatted table
    """
    stats = df_col.describe().to_frame().T
    print(tb.tabulate(stats, headers='keys', tablefmt='simple', floatfmt='.2f'))
In [173]:
def classify_column(col: pd.Series) -> str:
    """
    Classify a column into one of the following categories:
    - ID or Unique Identifier
    - Categorical
    - Ordinal or Discrete
    - Continuous Numerical
    - Binary
    - Other / Needs Manual Check
    """
    if col.nunique() == len(col):
        return "ID or Unique Identifier"
    elif col.dtype == 'object' or col.dtype.name == 'category':
        return "Categorical"
    elif col.dtype in ['int64', 'float64']:
        if col.nunique() < 10:
            return "Ordinal or Discrete (Numerical)"
        else:
            return "Continuous Numerical"
    elif pd.api.types.is_bool_dtype(col):
        return "Binary"
    else:
        return "Other / Needs Manual Check"

Load the Data¶

In [174]:
df = pd.read_csv('SuperKart.csv')
# backup of original df
df_original = df.copy()
In [175]:
df.head()
Out[175]:
Product_Id Product_Weight Product_Sugar_Content Product_Allocated_Area Product_Type Product_MRP Store_Id Store_Establishment_Year Store_Size Store_Location_City_Type Store_Type Product_Store_Sales_Total
0 FD6114 12.66 Low Sugar 0.027 Frozen Foods 117.08 OUT004 2009 Medium Tier 2 Supermarket Type2 2842.40
1 FD7839 16.54 Low Sugar 0.144 Dairy 171.43 OUT003 1999 Medium Tier 1 Departmental Store 4830.02
2 FD5075 14.28 Regular 0.031 Canned 162.08 OUT001 1987 High Tier 2 Supermarket Type1 4130.16
3 FD8233 12.10 Low Sugar 0.112 Baking Goods 186.31 OUT001 1987 High Tier 2 Supermarket Type1 4132.18
4 NC1180 9.57 No Sugar 0.010 Health and Hygiene 123.67 OUT002 1998 Small Tier 3 Food Mart 2279.36

Data Overview¶

In [176]:
def overview_data(df):
    """
    Comprehensive data overview function

    Args:
        df: pandas DataFrame

    Returns:
        dict: Overview statistics for further use
    """

    print("🔍 COMPREHENSIVE DATA OVERVIEW")
    print("=" * 50)

    print(f"💾 Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    print()

    # 1. Basic Shape & Structure
    r, c = df.shape
    print(f"📊 Dataset Shape: {r:,} rows × {c} columns")
    print(f"🔄 Duplicate Rows: {df.duplicated().sum()}")

    # 2. Missing Values Analysis
    print("\n❓ MISSING VALUES ANALYSIS")
    print("-" * 30)
    missing_count = df.isnull().sum().sum()
    missing_pct = ratio_to_pct(missing_count, len(df))
    missing_data = pd.DataFrame({
        'Column': df.columns,
        'Missing_Count': missing_count,
        'Missing_Percentage': missing_pct
    })
    missing_data = missing_data[missing_data['Missing_Count'] > 0]

    if len(missing_data) == 0:
        print("✅ No missing values found!")
    else:
        print(missing_data.to_string(index=False))

    # 3. Data Types Summary
    print("\n📋 DATA TYPES SUMMARY")
    print("-" * 30)
    dtype_summary = df.dtypes.value_counts()
    for dtype, count in dtype_summary.items():
        print(f"{dtype}: {count} columns")

    # 4. Identify Numerical vs Categorical Columns
    numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
    categorical_cols = df.select_dtypes(include=['object']).columns.tolist()

    print(f"\n🔢 NUMERICAL COLUMNS ({len(numerical_cols)}):")
    print(f"   {numerical_cols}")

    print(f"\n🏷️ CATEGORICAL COLUMNS ({len(categorical_cols)}):")
    print(f"   {categorical_cols}")

    # 5. Numerical Columns Statistics
    if numerical_cols:
        print("\n📈 NUMERICAL COLUMNS STATISTICS")
        print("-" * 40)
        numerical_stats = df[numerical_cols].describe().T
        print(numerical_stats.round(2))

    # 6. Categorical Columns Statistics
    if categorical_cols:
        print("\n📊 CATEGORICAL COLUMNS STATISTICS")
        print("-" * 40)
        for col in categorical_cols:
            unique_count = df[col].nunique()
            most_frequent = df[col].mode()[0]
            most_frequent_count = df[col].value_counts().iloc[0]

            print(f"{col}:")
            print(f"  - Unique Values: {unique_count}")
            print(f"  - Most Frequent: '{most_frequent}' ({most_frequent_count:,} times)")
            print(f"  - Sample Values: {list(df[col].unique()[:5])}")
            print()

    # 7. Target Variable Quick Stats (assuming last column or specify)
    target_col = df.columns[-1]  # Assuming target is last column
    if df[target_col].dtype in ['int64', 'float64']:
        print(f"🎯 TARGET VARIABLE: {target_col}")
        print("-" * 30)
        print(f"Min: {df[target_col].min():.2f}")
        print(f"Max: {df[target_col].max():.2f}")
        print(f"Mean: {df[target_col].mean():.2f}")
        print(f"Median: {df[target_col].median():.2f}")
        print(f"Std Dev: {df[target_col].std():.2f}")

    # Return summary for further use
    overview_stats = {
        'shape': df.shape,
        'missing_data': missing_data,
        'numerical_cols': numerical_cols,
        'categorical_cols': categorical_cols,
        'memory_usage_mb': df.memory_usage(deep=True).sum() / 1024**2,
        'duplicates': df.duplicated().sum()
    }

    return overview_stats
In [177]:
# Usage
overview = overview_data(df)
🔍 COMPREHENSIVE DATA OVERVIEW
==================================================
💾 Memory Usage: 4.16 MB

📊 Dataset Shape: 8,763 rows × 12 columns
🔄 Duplicate Rows: 0

❓ MISSING VALUES ANALYSIS
------------------------------
✅ No missing values found!

📋 DATA TYPES SUMMARY
------------------------------
object: 7 columns
float64: 4 columns
int64: 1 columns

🔢 NUMERICAL COLUMNS (5):
   ['Product_Weight', 'Product_Allocated_Area', 'Product_MRP', 'Store_Establishment_Year', 'Product_Store_Sales_Total']

🏷️ CATEGORICAL COLUMNS (7):
   ['Product_Id', 'Product_Sugar_Content', 'Product_Type', 'Store_Id', 'Store_Size', 'Store_Location_City_Type', 'Store_Type']

📈 NUMERICAL COLUMNS STATISTICS
----------------------------------------
                            count     mean      std     min      25%      50%  \
Product_Weight             8763.0    12.65     2.22     4.0    11.15    12.66   
Product_Allocated_Area     8763.0     0.07     0.05     0.0     0.03     0.06   
Product_MRP                8763.0   147.03    30.69    31.0   126.16   146.74   
Store_Establishment_Year   8763.0  2002.03     8.39  1987.0  1998.00  2009.00   
Product_Store_Sales_Total  8763.0  3464.00  1065.63    33.0  2761.72  3452.34   

                               75%     max  
Product_Weight               14.18    22.0  
Product_Allocated_Area        0.10     0.3  
Product_MRP                 167.58   266.0  
Store_Establishment_Year   2009.00  2009.0  
Product_Store_Sales_Total  4145.16  8000.0  

📊 CATEGORICAL COLUMNS STATISTICS
----------------------------------------
Product_Id:
  - Unique Values: 8763
  - Most Frequent: 'DR1005' (1 times)
  - Sample Values: ['FD6114', 'FD7839', 'FD5075', 'FD8233', 'NC1180']

Product_Sugar_Content:
  - Unique Values: 4
  - Most Frequent: 'Low Sugar' (4,885 times)
  - Sample Values: ['Low Sugar', 'Regular', 'No Sugar', 'reg']

Product_Type:
  - Unique Values: 16
  - Most Frequent: 'Fruits and Vegetables' (1,249 times)
  - Sample Values: ['Frozen Foods', 'Dairy', 'Canned', 'Baking Goods', 'Health and Hygiene']

Store_Id:
  - Unique Values: 4
  - Most Frequent: 'OUT004' (4,676 times)
  - Sample Values: ['OUT004', 'OUT003', 'OUT001', 'OUT002']

Store_Size:
  - Unique Values: 3
  - Most Frequent: 'Medium' (6,025 times)
  - Sample Values: ['Medium', 'High', 'Small']

Store_Location_City_Type:
  - Unique Values: 3
  - Most Frequent: 'Tier 2' (6,262 times)
  - Sample Values: ['Tier 2', 'Tier 1', 'Tier 3']

Store_Type:
  - Unique Values: 4
  - Most Frequent: 'Supermarket Type2' (4,676 times)
  - Sample Values: ['Supermarket Type2', 'Departmental Store', 'Supermarket Type1', 'Food Mart']

🎯 TARGET VARIABLE: Product_Store_Sales_Total
------------------------------
Min: 33.00
Max: 8000.00
Mean: 3464.00
Median: 3452.34
Std Dev: 1065.63
In [178]:
print("Columns Summary:")

summary = pd.DataFrame({
    "Column": df.columns,
    "Dtype": df.dtypes.values,
    "Missing": df.isnull().sum().values,
    "Unique": df.nunique().values
})

# Columns Summary
summary
Columns Summary:
Out[178]:
Column Dtype Missing Unique
0 Product_Id object 0 8763
1 Product_Weight float64 0 1113
2 Product_Sugar_Content object 0 4
3 Product_Allocated_Area float64 0 228
4 Product_Type object 0 16
5 Product_MRP float64 0 6100
6 Store_Id object 0 4
7 Store_Establishment_Year int64 0 4
8 Store_Size object 0 3
9 Store_Location_City_Type object 0 3
10 Store_Type object 0 4
11 Product_Store_Sales_Total float64 0 8668

🔍 Observation

  • None of the column has missing values
  • Product_Id and Store_Id are nominal identifiers → can be ignored for modeling
  • Dataset has mixed types → requires encoding + scaling

    Store_Establishment_Year can be treated as categorical though its type is numeric or new feature can be derived like age to represent it more naturally numeric

In [179]:
# Target variable stats summary
print("Target variable distribution:")
df['Product_Store_Sales_Total'].describe()
Target variable distribution:
Out[179]:
count    8763.000000
mean     3464.003640
std      1065.630494
min        33.000000
25%      2761.715000
50%      3452.340000
75%      4145.165000
max      8000.000000
Name: Product_Store_Sales_Total, dtype: float64

🎯 TARGET VARIABLE INSIGHTS:

  1. Sales range: \$33 - \\$8,000 shows diverse performance across stores
  2. Normal distribution: Mean ≈ Median indicates balanced, model-friendly data
  3. Moderate spread: 31% variation suggests consistent but varied performance
In [180]:
# ===========================
# 2. FEATURE CATEGORIZATION
# ===========================

# Define feature categories for systematic analysis
NUMERICAL_FEATURES = [
    'Product_Weight',
    'Product_Allocated_Area',
    'Product_MRP',
    'Store_Establishment_Year'
]

CATEGORICAL_FEATURES = [
    'Product_Sugar_Content',
    'Product_Type',
    'Store_Size',
    'Store_Location_City_Type',
    'Store_Type'
]

TARGET_VARIABLE = 'Product_Store_Sales_Total'

print("🏷️ FEATURE CATEGORIZATION")
print("-" * 40)
print(f"Numerical Features ({len(NUMERICAL_FEATURES)}): {NUMERICAL_FEATURES}")
print(f"Categorical Features ({len(CATEGORICAL_FEATURES)}): {CATEGORICAL_FEATURES}")
print(f"Target Variable: {TARGET_VARIABLE}")
🏷️ FEATURE CATEGORIZATION
----------------------------------------
Numerical Features (4): ['Product_Weight', 'Product_Allocated_Area', 'Product_MRP', 'Store_Establishment_Year']
Categorical Features (5): ['Product_Sugar_Content', 'Product_Type', 'Store_Size', 'Store_Location_City_Type', 'Store_Type']
Target Variable: Product_Store_Sales_Total

EDA¶

Univariate Analysis¶

Helper (Utils)¶

In [181]:
class UnivariateAnalyzer:
    """
    Comprehensive univariate analysis for EDA and preprocessing guidance

    Methods:
    - analyze_numerical_column: Complete numerical feature analysis
    - analyze_categorical_column: Complete categorical feature analysis
    """

    def __init__(self, figsize=(12, 6)):
        """
        Initialize analyzer with plotting configuration

        Args:
            figsize: Figure size for plots (default: (12, 6))
        """
        self.figsize = figsize
        # plt.style.use('default')

    def analyze_numerical_column(self, df, column_name):
        """
        Comprehensive analysis for a single numerical column

        Args:
            df: pandas DataFrame
            column_name: string, name of the numerical column

        Returns:
            dict: Analysis results for preprocessing guidance
        """

        print(f"📊 NUMERICAL ANALYSIS: {column_name}")
        print("=" * 50)

        col_data = df[column_name].dropna()

        # 1. Basic Statistics
        print("📈 DESCRIPTIVE STATISTICS")
        print("-" * 30)
        desc_stats = col_data.describe()
        print(desc_stats.round(3))

        # 2. Distribution Analysis
        print("\n📋 DISTRIBUTION ANALYSIS")
        print("-" * 30)

        # Skewness & Kurtosis
        skewness = stats.skew(col_data)
        kurtosis = stats.kurtosis(col_data)

        print(f"Skewness: {skewness:.3f}")
        if abs(skewness) < 0.5:
            skew_interpretation = "Approximately Normal"
        elif abs(skewness) < 1:
            skew_interpretation = "Moderately Skewed"
        else:
            skew_interpretation = "Highly Skewed"
        print(f"Interpretation: {skew_interpretation}")

        print(f"Kurtosis: {kurtosis:.3f}")
        if abs(kurtosis) < 3:
            kurt_interpretation = "Normal-like tails"
        else:
            kurt_interpretation = "Heavy tails"
        print(f"Interpretation: {kurt_interpretation}")

        # 3. Outlier Detection
        print("\n🔍 OUTLIER ANALYSIS")
        print("-" * 30)

        Q1 = col_data.quantile(0.25)
        Q3 = col_data.quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        outliers = col_data[(col_data < lower_bound) | (col_data > upper_bound)]
        outlier_percentage = (len(outliers) / len(col_data)) * 100

        print("IQR Method:")
        print(f"  Lower Bound: {lower_bound:.3f}")
        print(f"  Upper Bound: {upper_bound:.3f}")
        print(f"  Outliers: {len(outliers)} ({outlier_percentage:.1f}%)")

        # 4. Missing Values
        missing_count = df[column_name].isnull().sum()
        missing_percentage = (missing_count / len(df)) * 100
        print("\n❓ MISSING VALUES")
        print("-" * 30)
        print(f"Count: {missing_count} ({missing_percentage:.1f}%)")

        # 5. Preprocessing Recommendations
        print("\n🔧 PREPROCESSING RECOMMENDATIONS")
        print("-" * 30)

        recommendations = []

        # Scaling recommendations
        if col_data.std() > 1000 or col_data.max() - col_data.min() > 1000:
            recommendations.append("Consider StandardScaler or MinMaxScaler (large scale)")

        # Transformation recommendations
        if skewness > 1:
            recommendations.append("Consider Log transformation (right skewed)")
        elif skewness < -1:
            recommendations.append("Consider Square transformation (left skewed)")

        # Outlier handling
        if outlier_percentage > 5:
            recommendations.append("Consider RobustScaler (many outliers)")
        elif outlier_percentage > 1:
            recommendations.append("Consider outlier treatment (moderate outliers)")

        # Missing value handling
        if missing_percentage > 0:
            if missing_percentage < 5:
                recommendations.append("Consider mean/median imputation")
            else:
                recommendations.append("Investigate missing pattern, consider advanced imputation")

        if not recommendations:
            recommendations.append("Data appears clean, minimal preprocessing needed")

        for i, rec in enumerate(recommendations, 1):
            print(f"{i}. {rec}")

        # 6. Visualization
        fig, axes = plt.subplots(1, 2, figsize=self.figsize)
        fig.suptitle(f'{column_name} - Distribution Analysis', fontsize=14, fontweight='bold')

        # Histogram
        axes[0].hist(col_data, bins=30, alpha=0.7, color='skyblue', edgecolor='black')
        axes[0].axvline(col_data.mean(), color='red', linestyle='--', label=f'Mean: {col_data.mean():.2f}')
        axes[0].axvline(col_data.median(), color='green', linestyle='--', label=f'Median: {col_data.median():.2f}')
        axes[0].set_title('Distribution')
        axes[0].set_xlabel(column_name)
        axes[0].set_ylabel('Frequency')
        axes[0].legend()

        # Boxplot
        axes[1].boxplot(col_data)
        axes[1].set_title('Boxplot (Outlier Detection)')
        axes[1].set_ylabel(column_name)

        plt.tight_layout()
        plt.show()

        # Return analysis results
        analysis_results = {
            'skewness': skewness,
            'kurtosis': kurtosis,
            'outlier_percentage': outlier_percentage,
            'missing_percentage': missing_percentage,
            'recommendations': recommendations,
            'statistics': desc_stats.to_dict()
        }

        return analysis_results

    def analyze_categorical_column(self, df, column_name):
        """
        Comprehensive analysis for a single categorical column

        Args:
            df: pandas DataFrame
            column_name: string, name of the categorical column

        Returns:
            dict: Analysis results for preprocessing guidance
        """

        print(f"📊 CATEGORICAL ANALYSIS: {column_name}")
        print("=" * 50)

        col_data = df[column_name]

        # 1. Basic Statistics
        print("📈 DESCRIPTIVE STATISTICS")
        print("-" * 30)

        unique_count = col_data.nunique()
        total_count = len(col_data)

        print(f"Total Values: {total_count:,}")
        print(f"Unique Values: {unique_count}")
        print(f"Cardinality: {'High' if unique_count > 10 else 'Medium' if unique_count > 2 else 'Low'}")

        # 2. Value Distribution
        print("\n📋 VALUE DISTRIBUTION")
        print("-" * 30)

        value_counts = col_data.value_counts()
        value_percentages = col_data.value_counts(normalize=True) * 100

        print("Top 10 Values:")
        for i, (value, count) in enumerate(value_counts.head(10).items(), 1):
            percentage = value_percentages[value]
            print(f"{i:2d}. {value}: {count:,} ({percentage:.1f}%)")

        # 3. Data Quality Check
        print("\n🔍 DATA QUALITY CHECK")
        print("-" * 30)

        # Missing values
        missing_count = col_data.isnull().sum()
        missing_percentage = (missing_count / len(col_data)) * 100
        print(f"Missing Values: {missing_count} ({missing_percentage:.1f}%)")

        # Check for inconsistencies (case sensitivity, whitespace, etc.)
        quality_issues = []

        # Case sensitivity check
        values_lower = [str(v).lower() for v in col_data.dropna().unique()]
        if len(values_lower) != len(set(values_lower)):
            quality_issues.append("Potential case sensitivity issues")

        # Whitespace check
        string_values = [str(v) for v in col_data.dropna().unique()]
        stripped_values = [v.strip() for v in string_values]
        if len(string_values) != len(set(stripped_values)):
            quality_issues.append("Potential whitespace issues")

        # Check for abbreviated vs full forms
        if any(len(str(v)) <= 3 for v in col_data.dropna().unique()):
            quality_issues.append("Mix of abbreviated and full forms detected")

        if quality_issues:
            print("Issues Found:")
            for issue in quality_issues:
                print(f"  ⚠️  {issue}")
        else:
            print("✅ No major quality issues detected")

        # 4. Class Imbalance Analysis
        print("\n⚖️  CLASS BALANCE ANALYSIS")
        print("-" * 30)

        # Check for imbalance
        min_percentage = value_percentages.min()
        max_percentage = value_percentages.max()
        imbalance_ratio = max_percentage / min_percentage

        print(f"Most Common: {max_percentage:.1f}%")
        print(f"Least Common: {min_percentage:.1f}%")
        print(f"Imbalance Ratio: {imbalance_ratio:.1f}:1")

        if imbalance_ratio > 10:
            balance_status = "Highly Imbalanced"
        elif imbalance_ratio > 5:
            balance_status = "Moderately Imbalanced"
        else:
            balance_status = "Reasonably Balanced"

        print(f"Status: {balance_status}")

        # 5. Preprocessing Recommendations
        print("\n🔧 PREPROCESSING RECOMMENDATIONS")
        print("-" * 30)

        recommendations = []

        # Encoding recommendations
        if unique_count == 2:
            recommendations.append("Consider Label Encoding (binary categorical)")
        elif unique_count <= 5:
            recommendations.append("Consider One-Hot Encoding (low cardinality)")
        else:
            recommendations.append("Consider Target Encoding or Embedding (high cardinality)")

        # Quality improvements
        if quality_issues:
            recommendations.append("Clean data inconsistencies before encoding")

        # Handle missing values
        if missing_percentage > 0:
            if missing_percentage < 5:
                recommendations.append("Consider mode imputation for missing values")
            else:
                recommendations.append("Investigate missing pattern, consider 'Unknown' category")

        # Handle imbalance
        if imbalance_ratio > 10:
            recommendations.append("(Imbalance) Consider grouping rare categories or stratified sampling")

        for i, rec in enumerate(recommendations, 1):
            print(f"{i}. {rec}")

        # 6. Visualization
        plt.figure(figsize=self.figsize)

        # Limit to top 15 categories for readability
        top_categories = value_counts.head(15)

        # Create bar plot
        bars = plt.bar(range(len(top_categories)), top_categories.values,
                      color=plt.cm.Set3(np.linspace(0, 1, len(top_categories))))

        plt.title(f'{column_name} - Value Distribution (Top 15)', fontsize=14, fontweight='bold')
        plt.xlabel('Categories')
        plt.ylabel('Count')
        plt.xticks(range(len(top_categories)), top_categories.index, rotation=45, ha='right')

        # Add value labels on bars
        for bar, count in zip(bars, top_categories.values):
            plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + max(top_categories.values)*0.01,
                    f'{count:,}', ha='center', va='bottom', fontsize=9)

        plt.tight_layout()
        plt.show()

        # Return analysis results
        analysis_results = {
            'unique_count': unique_count,
            'missing_percentage': missing_percentage,
            'imbalance_ratio': imbalance_ratio,
            'quality_issues': quality_issues,
            'recommendations': recommendations,
            'value_counts': value_counts.to_dict()
        }

        return analysis_results
In [182]:
analyzer = UnivariateAnalyzer()

0. Product ID¶

In [183]:
classify_column(df['Product_Id'])
Out[183]:
'ID or Unique Identifier'

Since this is nominal by nature nothing to explore for it

1. Product Weight¶

In [184]:
# Nature of column
classify_column(df['Product_Weight'])
Out[184]:
'Continuous Numerical'
In [185]:
results = analyzer.analyze_numerical_column(df, 'Product_Weight')
📊 NUMERICAL ANALYSIS: Product_Weight
==================================================
📈 DESCRIPTIVE STATISTICS
------------------------------
count    8763.000
mean       12.654
std         2.217
min         4.000
25%        11.150
50%        12.660
75%        14.180
max        22.000
Name: Product_Weight, dtype: float64

📋 DISTRIBUTION ANALYSIS
------------------------------
Skewness: 0.018
Interpretation: Approximately Normal
Kurtosis: 0.004
Interpretation: Normal-like tails

🔍 OUTLIER ANALYSIS
------------------------------
IQR Method:
  Lower Bound: 6.605
  Upper Bound: 18.725
  Outliers: 54 (0.6%)

❓ MISSING VALUES
------------------------------
Count: 0 (0.0%)

🔧 PREPROCESSING RECOMMENDATIONS
------------------------------
1. Data appears clean, minimal preprocessing needed
In [186]:
pp.pprint(results)
{'kurtosis': 0.003881980226869697,
 'missing_percentage': 0.0,
 'outlier_percentage': 0.6162273194111606,
 'recommendations': ['Data appears clean, minimal preprocessing needed'],
 'skewness': 0.01751117204148118,
 'statistics': {'25%': 11.15,
                '50%': 12.66,
                '75%': 14.18,
                'count': 8763.0,
                'max': 22.0,
                'mean': 12.653792080337787,
                'min': 4.0,
                'std': 2.2173201376193634}}

🔍 Observations:

• Nearly perfect distribution: Skewness (0.02) and mean ≈ median indicate well-balanced, normal-like data distribution

• High data quality: Zero missing values and minimal outliers (0.6%) suggest clean, reliable weight measurements across products

• Preprocessing advantage: Excellent statistical properties mean minimal transformation needed

📌 NOTE: Scaling decision depends on final model choice

So What To Do with Product_Weight ?

No transformation → it’s already symmetric and clean

✅ But scale it (e.g., StandardScaler) along with other numeric features during preprocessing

2. Product Sugar Content¶

In [187]:
# Nature of column
classify_column(df['Product_Sugar_Content'])
Out[187]:
'Categorical'
In [188]:
results = analyzer.analyze_categorical_column(df, 'Product_Sugar_Content')
📊 CATEGORICAL ANALYSIS: Product_Sugar_Content
==================================================
📈 DESCRIPTIVE STATISTICS
------------------------------
Total Values: 8,763
Unique Values: 4
Cardinality: Medium

📋 VALUE DISTRIBUTION
------------------------------
Top 10 Values:
 1. Low Sugar: 4,885 (55.7%)
 2. Regular: 2,251 (25.7%)
 3. No Sugar: 1,519 (17.3%)
 4. reg: 108 (1.2%)

🔍 DATA QUALITY CHECK
------------------------------
Missing Values: 0 (0.0%)
Issues Found:
  ⚠️  Mix of abbreviated and full forms detected

⚖️  CLASS BALANCE ANALYSIS
------------------------------
Most Common: 55.7%
Least Common: 1.2%
Imbalance Ratio: 45.2:1
Status: Highly Imbalanced

🔧 PREPROCESSING RECOMMENDATIONS
------------------------------
1. Consider One-Hot Encoding (low cardinality)
2. Clean data inconsistencies before encoding
3. (Imbalance) Consider grouping rare categories or stratified sampling
In [189]:
pp.pprint(results)
{'imbalance_ratio': 45.23148148148148,
 'missing_percentage': 0.0,
 'quality_issues': ['Mix of abbreviated and full forms detected'],
 'recommendations': ['Consider One-Hot Encoding (low cardinality)',
                     'Clean data inconsistencies before encoding',
                     '(Imbalance) Consider grouping rare categories or '
                     'stratified sampling'],
 'unique_count': 4,
 'value_counts': {'Low Sugar': 4885,
                  'No Sugar': 1519,
                  'Regular': 2251,
                  'reg': 108}}

🔍 Observation: Product_Sugar_Content

  • Column captures sugar level of products with 4 unique categories.
  • The data contains both inconsistent formats (e.g. 'Regular' vs 'reg') → needs cleaning before use.
  • Category distribution is highly imbalanced, with 'Low Sugar' dominating ~56% of entries.
  • Low cardinality - can be one-hot encoded after cleaning.

Solution Strategy:

  • NOT sampling (that's for classification targets)

Instead:

  • Merge rare categories (reg → Regular)
  • Domain knowledge grouping if makes business sense

3. Product Allocated Area¶

desc: space given on display for this product amongst all other products from same store

In [190]:
# Nature of column
classify_column(df['Product_Allocated_Area'])
Out[190]:
'Continuous Numerical'
In [191]:
df['Product_Allocated_Area'].head()
Out[191]:
0    0.027
1    0.144
2    0.031
3    0.112
4    0.010
Name: Product_Allocated_Area, dtype: float64

📌 NOTE : It's a ratio, hence value will always lie between 0 to 1

In [192]:
results = analyzer.analyze_numerical_column(df, 'Product_Allocated_Area')
📊 NUMERICAL ANALYSIS: Product_Allocated_Area
==================================================
📈 DESCRIPTIVE STATISTICS
------------------------------
count    8763.000
mean        0.069
std         0.048
min         0.004
25%         0.031
50%         0.056
75%         0.096
max         0.298
Name: Product_Allocated_Area, dtype: float64

📋 DISTRIBUTION ANALYSIS
------------------------------
Skewness: 1.128
Interpretation: Highly Skewed
Kurtosis: 1.292
Interpretation: Normal-like tails

🔍 OUTLIER ANALYSIS
------------------------------
IQR Method:
  Lower Bound: -0.067
  Upper Bound: 0.194
  Outliers: 104 (1.2%)

❓ MISSING VALUES
------------------------------
Count: 0 (0.0%)

🔧 PREPROCESSING RECOMMENDATIONS
------------------------------
1. Consider Log transformation (right skewed)
2. Consider outlier treatment (moderate outliers)

⚡ Insights

  • There are few outliers (tailing to upper end)
In [193]:
pp.pprint(results)
{'kurtosis': 1.2917138525421574,
 'missing_percentage': 0.0,
 'outlier_percentage': 1.1868081707177907,
 'recommendations': ['Consider Log transformation (right skewed)',
                     'Consider outlier treatment (moderate outliers)'],
 'skewness': 1.1279000327813162,
 'statistics': {'25%': 0.031,
                '50%': 0.056,
                '75%': 0.096,
                'count': 8763.0,
                'max': 0.298,
                'mean': 0.0687863745292708,
                'min': 0.004,
                'std': 0.04820376849944802}}

🔍 Observations

  • No missing values in this feature.
  • Distribution is right-skewed (skewness ≈ 1.13), ie Most products get small shelf space allocation, but few premium products get significantly more display area
  • Business pattern: Typical product gets ~5.6% allocation, but outliers suggest strategic high-visibility placements for key products

⚠️ Preprocessing needed: Right skewness and moderate outliers indicate log/power transformation will improve model performance

Given its likely link to product visibility and sales, this feature is expected to be an important predictor.

4. Product Type¶

In [194]:
# Nature of column
classify_column(df['Product_Type'])
Out[194]:
'Categorical'
In [195]:
results = analyzer.analyze_categorical_column(df, 'Product_Type')
📊 CATEGORICAL ANALYSIS: Product_Type
==================================================
📈 DESCRIPTIVE STATISTICS
------------------------------
Total Values: 8,763
Unique Values: 16
Cardinality: High

📋 VALUE DISTRIBUTION
------------------------------
Top 10 Values:
 1. Fruits and Vegetables: 1,249 (14.3%)
 2. Snack Foods: 1,149 (13.1%)
 3. Frozen Foods: 811 (9.3%)
 4. Dairy: 796 (9.1%)
 5. Household: 740 (8.4%)
 6. Baking Goods: 716 (8.2%)
 7. Canned: 677 (7.7%)
 8. Health and Hygiene: 628 (7.2%)
 9. Meat: 618 (7.1%)
10. Soft Drinks: 519 (5.9%)

🔍 DATA QUALITY CHECK
------------------------------
Missing Values: 0 (0.0%)
✅ No major quality issues detected

⚖️  CLASS BALANCE ANALYSIS
------------------------------
Most Common: 14.3%
Least Common: 0.9%
Imbalance Ratio: 16.4:1
Status: Highly Imbalanced

🔧 PREPROCESSING RECOMMENDATIONS
------------------------------
1. Consider Target Encoding or Embedding (high cardinality)
2. (Imbalance) Consider grouping rare categories or stratified sampling
In [196]:
table_it(results['value_counts'], show_index=True)
  #  Key                      Value
---  ---------------------  -------
  1  Fruits and Vegetables     1249
  2  Snack Foods               1149
  3  Frozen Foods               811
  4  Dairy                      796
  5  Household                  740
  6  Baking Goods               716
  7  Canned                     677
  8  Health and Hygiene         628
  9  Meat                       618
 10  Soft Drinks                519
 11  Breads                     200
 12  Hard Drinks                186
 13  Others                     151
 14  Starchy Foods              141
 15  Breakfast                  106
 16  Seafood                     76
In [197]:
results
Out[197]:
{'unique_count': 16,
 'missing_percentage': 0.0,
 'imbalance_ratio': 16.43421052631579,
 'quality_issues': [],
 'recommendations': ['Consider Target Encoding or Embedding (high cardinality)',
  '(Imbalance) Consider grouping rare categories or stratified sampling'],
 'value_counts': {'Fruits and Vegetables': 1249,
  'Snack Foods': 1149,
  'Frozen Foods': 811,
  'Dairy': 796,
  'Household': 740,
  'Baking Goods': 716,
  'Canned': 677,
  'Health and Hygiene': 628,
  'Meat': 618,
  'Soft Drinks': 519,
  'Breads': 200,
  'Hard Drinks': 186,
  'Others': 151,
  'Starchy Foods': 141,
  'Breakfast': 106,
  'Seafood': 76}}

🔍 Observations Product_Type:

• Diverse product portfolio: 16 categories ranging from dominant fresh produce (1,249 items) to niche seafood (76 items) - reflects SuperKart's full supermarket range

• Clear category hierarchy: Fresh items (Fruits/Vegetables, Snacks) dominate inventory, while specialty items (Seafood, Breakfast) form smaller segments

• High cardinality challenge: 16 categories with 16:1 imbalance ratio requires Target Encoding rather than One-Hot to avoid sparse features

• Business opportunity: Rare categories like Seafood might benefit from strategic expansion or could be grouped with similar categories for modeling

💡 Group Rare Categories (Valid Option), ie can combine rare product types into an "Other" category. (esp for OHE)

🧠 NOTE: One-Hot Encoding Problem:

Creates 16 binary columns:

  • Fruits_Vegetables: [1,0,0,0,0...] - 1,249 ones, 7,514 zeros
  • Snack_Foods: [0,1,0,0,0...] - 1,149 ones, 7,614 zeros
  • Seafood: [0,0,0,0,1...] - Only 76 ones, 8,687 zeros!

Issues:

  • Sparse matrix: Most values are zeros (wasteful)
  • Seafood column: 99% zeros → barely any signal for model to learn
  • 16 features: Increases dimensionality without much benefit

Target Encoding Solution:

Creates 1 numerical column based on average sales:

  • Fruits_Vegetables → 3,200 (avg sales)
  • Snack_Foods → 3,400 (avg sales)
  • Seafood → 2,800 (avg sales)
  • etc.

Benefits:

  • Dense information: Every value is meaningful
  • 1 feature: Compact representation
  • Handles rare categories: Even 76 Seafood samples contribute to meaningful average

Analogy:

One-Hot = Giving each person a separate yes/no checkbox
Target = Giving each person a single "performance score"

Target Encoding captures the relationship while staying compact!

Thumb Rule

  1. Linear Models: OHE preferred (avoids false ordinal)

  2. Tree Models: Either works (trees find splits, don't assume order)

  • target encoding can mislead linear models into thinking categories have numerical relationships when they don't!

So for Product_Type: OHE after grouping rare categories is the safer approach.

Linear models are associated with OHE because of how they handle coefficients.

  • Each category gets independent treatment - no forced linear relationships between unrelated categories.
  • That's why linear models prefer OHE - it respects the categorical nature of the data.

5 Product MRP¶

In [198]:
# nature of column
classify_column(df['Product_MRP'])
Out[198]:
'Continuous Numerical'
In [199]:
results = analyzer.analyze_numerical_column(df, 'Product_MRP')
📊 NUMERICAL ANALYSIS: Product_MRP
==================================================
📈 DESCRIPTIVE STATISTICS
------------------------------
count    8763.000
mean      147.033
std        30.694
min        31.000
25%       126.160
50%       146.740
75%       167.585
max       266.000
Name: Product_MRP, dtype: float64

📋 DISTRIBUTION ANALYSIS
------------------------------
Skewness: 0.037
Interpretation: Approximately Normal
Kurtosis: -0.012
Interpretation: Normal-like tails

🔍 OUTLIER ANALYSIS
------------------------------
IQR Method:
  Lower Bound: 64.022
  Upper Bound: 229.723
  Outliers: 57 (0.7%)

❓ MISSING VALUES
------------------------------
Count: 0 (0.0%)

🔧 PREPROCESSING RECOMMENDATIONS
------------------------------
1. Data appears clean, minimal preprocessing needed
In [200]:
pp.pprint(results)
{'kurtosis': -0.01226820030102882,
 'missing_percentage': 0.0,
 'outlier_percentage': 0.6504621704895585,
 'recommendations': ['Data appears clean, minimal preprocessing needed'],
 'skewness': 0.03650675645590474,
 'statistics': {'25%': 126.16,
                '50%': 146.74,
                '75%': 167.585,
                'count': 8763.0,
                'max': 266.0,
                'mean': 147.03253908478834,
                'min': 31.0,
                'std': 30.69411031174993}}

⚡ Insights

  • Graph seems normally distributed

  • The distribution is fairly symmetric (low skewness) and normal-like, indicating good spread across price ranges.

  • Excellent data quality: No missing values and minimal outliers (0.65%) suggest consistent, reliable pricing data collection

  • Business insight: Price range (₹31-₹266) with ₹147 average reflects diverse product portfolio from budget to premium segments

  • Preprocessing advantage: Clean statistical properties require minimal transformation - can use raw MRP values directly in modeling

💡 NOTE: ✅ So, even clean numeric features like this should be scaled (e.g., StandardScaler) when using linear models

6. Store ID¶

In [201]:
# nature of column
classify_column(df['Store_Id'])
Out[201]:
'Categorical'
In [202]:
analyzer.analyze_categorical_column(df, 'Store_Id')
📊 CATEGORICAL ANALYSIS: Store_Id
==================================================
📈 DESCRIPTIVE STATISTICS
------------------------------
Total Values: 8,763
Unique Values: 4
Cardinality: Medium

📋 VALUE DISTRIBUTION
------------------------------
Top 10 Values:
 1. OUT004: 4,676 (53.4%)
 2. OUT001: 1,586 (18.1%)
 3. OUT003: 1,349 (15.4%)
 4. OUT002: 1,152 (13.1%)

🔍 DATA QUALITY CHECK
------------------------------
Missing Values: 0 (0.0%)
✅ No major quality issues detected

⚖️  CLASS BALANCE ANALYSIS
------------------------------
Most Common: 53.4%
Least Common: 13.1%
Imbalance Ratio: 4.1:1
Status: Reasonably Balanced

🔧 PREPROCESSING RECOMMENDATIONS
------------------------------
1. Consider One-Hot Encoding (low cardinality)
Out[202]:
{'unique_count': 4,
 'missing_percentage': 0.0,
 'imbalance_ratio': 4.059027777777777,
 'quality_issues': [],
 'recommendations': ['Consider One-Hot Encoding (low cardinality)'],
 'value_counts': {'OUT004': 4676,
  'OUT001': 1586,
  'OUT003': 1349,
  'OUT002': 1152}}

⚡ Insights

  • Store_Id is a nominal identifier - it doesn't carry inherent predictive value.

  • Including it would cause the model to memorize store specific patterns, which hurts generalization to unseen data or stores.

  • In real-world deployment, we can't rely on store specific IDs to predict - we want the model to learn from store attributes like type, size, and location.

🔍 Observations

  • Uneven store representation: OUT004 dominates dataset (53% of records) while OUT002 has least data (13%) - reflects different store sizes or product ranges

  • Moderate imbalance: 4:1 ratio suggests some stores contribute significantly more product-store combinations than others.

  • EDA insight: Store representation differences may reflect business reality (flagship vs smaller outlets) rather than data collection bias

✅ Drop Store_Id - it adds noise and risks overfitting

7. Store Establishment Year¶

In [203]:
# nature of column
classify_column(df['Store_Establishment_Year'])
Out[203]:
'Ordinal or Discrete (Numerical)'
In [204]:
results = analyzer.analyze_categorical_column(df, 'Store_Establishment_Year')
📊 CATEGORICAL ANALYSIS: Store_Establishment_Year
==================================================
📈 DESCRIPTIVE STATISTICS
------------------------------
Total Values: 8,763
Unique Values: 4
Cardinality: Medium

📋 VALUE DISTRIBUTION
------------------------------
Top 10 Values:
 1. 2009: 4,676 (53.4%)
 2. 1987: 1,586 (18.1%)
 3. 1999: 1,349 (15.4%)
 4. 1998: 1,152 (13.1%)

🔍 DATA QUALITY CHECK
------------------------------
Missing Values: 0 (0.0%)
✅ No major quality issues detected

⚖️  CLASS BALANCE ANALYSIS
------------------------------
Most Common: 53.4%
Least Common: 13.1%
Imbalance Ratio: 4.1:1
Status: Reasonably Balanced

🔧 PREPROCESSING RECOMMENDATIONS
------------------------------
1. Consider One-Hot Encoding (low cardinality)

⚡ Insights

  • As we can see there are only 4 distinct values (although its numeric in nature)
  • Also, having so few unique values means it may behave like a quasi-categorical column anyway, and offer limited numeric variance.

Considering as numeric column -

  • In regression (especially linear), it assumes linear relationships - e.g., newer stores have proportionally more or less sales - which may or may not be valid.

📌 Risk: It introduces a false sense of continuity or ordering if the relationship is not actually linear.

🧠 Domain KnowHow

Business Context Consideration:

What really matters for sales ?

  • Store Age (maturity, customer loyalty, infrastructure condition)
  • Not the specific year it was established

Recommended Approach: (Feat Eng)

  • Transform to Store_Age (ie derived feature)
  • Few age buckets:
    • Bin it into categories
    • e.g. "New" (0-5 years), "Mature" (6-15 years), "Established" (15+ years) → Categorical
  • Continuous age: Treat as numeric if linear relationship makes sense

Store_Age transformation captures the business logic better than raw establishment year.

8. Store Size¶

In [205]:
# nature of column
classify_column(df['Store_Size'])
Out[205]:
'Categorical'
In [206]:
results = analyzer.analyze_categorical_column(df, 'Store_Size')
📊 CATEGORICAL ANALYSIS: Store_Size
==================================================
📈 DESCRIPTIVE STATISTICS
------------------------------
Total Values: 8,763
Unique Values: 3
Cardinality: Medium

📋 VALUE DISTRIBUTION
------------------------------
Top 10 Values:
 1. Medium: 6,025 (68.8%)
 2. High: 1,586 (18.1%)
 3. Small: 1,152 (13.1%)

🔍 DATA QUALITY CHECK
------------------------------
Missing Values: 0 (0.0%)
✅ No major quality issues detected

⚖️  CLASS BALANCE ANALYSIS
------------------------------
Most Common: 68.8%
Least Common: 13.1%
Imbalance Ratio: 5.2:1
Status: Moderately Imbalanced

🔧 PREPROCESSING RECOMMENDATIONS
------------------------------
1. Consider One-Hot Encoding (low cardinality)
In [207]:
pp.pprint(results)
{'imbalance_ratio': 5.230034722222221,
 'missing_percentage': 0.0,
 'quality_issues': [],
 'recommendations': ['Consider One-Hot Encoding (low cardinality)'],
 'unique_count': 3,
 'value_counts': {'High': 1586, 'Medium': 6025, 'Small': 1152}}

⚡ Insights

  • Medium stores dominate: 69% of product-store combinations come from medium-sized stores, suggesting SuperKart's primary business model focuses on mid-tier outlets

  • Balanced representation: All three size categories have substantial data (1,000+ records each) ensuring reliable statistical analysis for each store type.

  • Natural ordering present: Small < Medium < High represents clear business hierarchy that should be preserved in modeling

Prefer Ordinal Encoding for this column !

9. Store Location City Type¶

In [208]:
# nature of column
classify_column(df['Store_Location_City_Type'])
Out[208]:
'Categorical'
In [209]:
results = analyzer.analyze_categorical_column(df, 'Store_Location_City_Type')
📊 CATEGORICAL ANALYSIS: Store_Location_City_Type
==================================================
📈 DESCRIPTIVE STATISTICS
------------------------------
Total Values: 8,763
Unique Values: 3
Cardinality: Medium

📋 VALUE DISTRIBUTION
------------------------------
Top 10 Values:
 1. Tier 2: 6,262 (71.5%)
 2. Tier 1: 1,349 (15.4%)
 3. Tier 3: 1,152 (13.1%)

🔍 DATA QUALITY CHECK
------------------------------
Missing Values: 0 (0.0%)
✅ No major quality issues detected

⚖️  CLASS BALANCE ANALYSIS
------------------------------
Most Common: 71.5%
Least Common: 13.1%
Imbalance Ratio: 5.4:1
Status: Moderately Imbalanced

🔧 PREPROCESSING RECOMMENDATIONS
------------------------------
1. Consider One-Hot Encoding (low cardinality)
In [210]:
results
Out[210]:
{'unique_count': 3,
 'missing_percentage': 0.0,
 'imbalance_ratio': 5.4357638888888875,
 'quality_issues': [],
 'recommendations': ['Consider One-Hot Encoding (low cardinality)'],
 'value_counts': {'Tier 2': 6262, 'Tier 1': 1349, 'Tier 3': 1152}}

⚡ Insights

  • Tier 2 city focus: 71% of SuperKart's operations concentrate in Tier 2 cities, indicating strategic positioning in mid-tier urban markets

  • Balanced tier representation: Both Tier 1 and Tier 3 cities have similar presence (15% and 13% respectively), suggesting comprehensive market coverage

  • Natural city hierarchy: Tier 1 > Tier 2 > Tier 3 represents clear economic progression that should be preserved in modeling

  • Strategic insight: Heavy Tier 2 presence suggests SuperKart targets growing urban centers rather than just premium or budget markets

Prefer Ordinal encoding for this column !

10. Store Type¶

In [211]:
# nature
classify_column(df['Store_Type'])
Out[211]:
'Categorical'
In [212]:
results = analyzer.analyze_categorical_column(df, 'Store_Type')
📊 CATEGORICAL ANALYSIS: Store_Type
==================================================
📈 DESCRIPTIVE STATISTICS
------------------------------
Total Values: 8,763
Unique Values: 4
Cardinality: Medium

📋 VALUE DISTRIBUTION
------------------------------
Top 10 Values:
 1. Supermarket Type2: 4,676 (53.4%)
 2. Supermarket Type1: 1,586 (18.1%)
 3. Departmental Store: 1,349 (15.4%)
 4. Food Mart: 1,152 (13.1%)

🔍 DATA QUALITY CHECK
------------------------------
Missing Values: 0 (0.0%)
✅ No major quality issues detected

⚖️  CLASS BALANCE ANALYSIS
------------------------------
Most Common: 53.4%
Least Common: 13.1%
Imbalance Ratio: 4.1:1
Status: Reasonably Balanced

🔧 PREPROCESSING RECOMMENDATIONS
------------------------------
1. Consider One-Hot Encoding (low cardinality)
In [213]:
pp.pprint(results)
{'imbalance_ratio': 4.059027777777777,
 'missing_percentage': 0.0,
 'quality_issues': [],
 'recommendations': ['Consider One-Hot Encoding (low cardinality)'],
 'unique_count': 4,
 'value_counts': {'Departmental Store': 1349,
                  'Food Mart': 1152,
                  'Supermarket Type1': 1586,
                  'Supermarket Type2': 4676}}

⚡ Insights

  • This is a categorical (nominal) variable with 4 unique types.

  • Supermarket Type2 dominance: 53% of operations are Type2 format, indicating this is SuperKart's primary business model across locations

  • Other formats still have 1000+ samples, ensuring reliable analysis at-least.

One-Hot Encoding is suitable here


BiVariate/MultiVariate Analysis¶

Helpers (python)¶

In [214]:
class EDAHelper:
    """
    Helper class for common EDA visualizations
    """

    @staticmethod
    def plot_scatter_with_trend(df, x_col_name, y_col_name, x_label, y_label, title=None, color='skyblue', show_it=False):
        """
        Create scatter plot with trend line

        Args:
            df: DataFrame
            x_col_name: x-axis column name
            y_col_name: y-axis column name
            x_label: x-axis label
            y_label: y-axis label
            title: plot title (optional)
            figsize: figure size tuple
            color: scatter plot color
            show_it: whether to show the plot

        Returns:
            correlation coefficient
        """

        x_col = df[x_col_name]
        y_col = df[y_col_name]

        # Scatter plot
        plt.scatter(x_col, y_col, alpha=0.6, color=color)

        # Trend line
        z = np.polyfit(x_col, y_col, 1)
        p = np.poly1d(z)
        plt.plot(x_col, p(x_col), "r--", alpha=0.8)

        # Labels and formatting
        plt.xlabel(x_label)
        plt.ylabel(y_label)
        plt.title(title or f'{x_label} vs {y_label}')
        plt.grid(True, alpha=0.3)

        # Correlation
        correlation = x_col.corr(y_col)
        plt.text(0.05, 0.95, f'Correlation: {correlation:.3f}',
                transform=plt.gca().transAxes, bbox=dict(boxstyle="round", facecolor='white'))

        if show_it:
            plt.tight_layout()
            plt.show()

        return correlation

    @staticmethod
    def plot_binned_analysis(df, x_col_name, y_col_name, x_label, y_label, bins=5, labels=None, title=None, color='lightcoral', show_it=False):
        """
        Create binned analysis bar plot

        Args:
            df: DataFrame
            x_col_name: x-axis column name
            y_col_name: y-axis column name
            x_label: x-axis label
            y_label: y-axis label
            bins: number of bins or custom bin edges
            labels: custom labels for bins
            title: plot title
            figsize: figure size
            color: bar color
            show_it: whether to show the plot

        Returns:
            binned summary series
        """

        # plt.figure(figsize=figsize)
        x_col = df[x_col_name]

        # Create bins
        if labels is None:
            labels = [f'Bin_{i+1}' for i in range(bins)]

        feat_bins = pd.cut(x_col, bins=bins, labels=labels)
        binned_summary = df.groupby(feat_bins)[y_col_name].mean()

        # Bar plot
        bars = plt.bar(binned_summary.index, binned_summary.values, color=color)
        plt.xlabel(x_label + ' Categories')
        plt.ylabel(f'Average {y_label}')
        plt.title(title or f'Average {y_label} by {x_label} Bins')
        plt.xticks(rotation=45)

        # Add value labels on bars
        for bar, value in zip(bars, binned_summary.values):
            plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + max(binned_summary.values)*0.01,
                    f'{value:.0f}', ha='center', va='bottom')
        if show_it:
            plt.tight_layout()
            plt.show()

        return binned_summary
In [215]:
eda_helper = EDAHelper()

Target Relationship Analysis

1. MRP vs Sales¶

🔭 Possible outcome

  • Price-sales elasticity, premium vs budget performance
In [216]:
# Product_MRP vs Sales
def analyze_mrp_vs_sales(df):
    """
    Analyze the relationship between Product MRP and Total Sales.

    Plots:
        1. Scatter plot with trend line.
        2. Bar plot of average sales per MRP bin.

    Returns:
        float: Correlation coefficient.
        pd.Series: Average sales per MRP bin.
    """
    # plt.figure(figsize=(12, 6))

    mrp_col = "Product_MRP"
    sales_col = "Product_Store_Sales_Total"

    x_label = "MRP"
    y_label = "Sales"

    # Scatter plot with trend line
    plt.subplot(1, 2, 1)
    correlation = eda_helper.plot_scatter_with_trend(
        df, mrp_col, sales_col, x_label, y_label
    )

    # Binned analysis
    plt.subplot(1, 2, 2)
    binned_summary = eda_helper.plot_binned_analysis(
        df,
        mrp_col,
        sales_col,
        x_label,
        y_label,
        bins=5,
        labels=["Very Low", "Low", "Medium", "High", "Very High"],
    )

    plt.tight_layout()
    plt.show()

    return correlation, binned_summary
In [217]:
# Run analysis
correlation, mrp_summary = analyze_mrp_vs_sales(df)
print(f"Correlation coefficient: {correlation:.3f}")
print("\nAverage Sales by MRP Range:")
print(mrp_summary.round(0))
Correlation coefficient: 0.788

Average Sales by MRP Range:
Product_MRP
Very Low     1272.0
Low          2445.0
Medium       3493.0
High         4570.0
Very High    5808.0
Name: Product_Store_Sales_Total, dtype: float64

⚡ Insights

  • Strong positive relationship: Correlation of 0.788 indicates higher-priced products consistently generate more sales revenue

  • Clear pricing hierarchy: Sales nearly quintuple from very low (₹1,272) to very high (₹5,808) MRP ranges, suggesting effective premium pricing strategy

  • Linear trend confirmed: Consistent step-up across all price categories validates MRP as a strong predictor for sales forecasting

  • Business insight: SuperKart's pricing strategy aligns well with market demand - premium products deliver proportionally higher returns

2. Product Allocation Area vs Sales¶

🔭 Possible outcome

  • Shelf space ROI, optimal allocation insights
In [218]:
# Product_Allocated_Area vs Sales
def analyze_area_vs_sales(df):
    """
    Analyze relationship between Product Allocated Area and Sales
    """
    x_col_name = "Product_Allocated_Area"
    y_col_name = "Product_Store_Sales_Total"
    x_label = "Allocated Area"
    y_label = "Sales"

    # Correlation coefficient
    plt.subplot(1, 2, 1)
    correlation = eda_helper.plot_scatter_with_trend(
        df, x_col_name, y_col_name, x_label, y_label
    )

    # Binned analysis
    plt.subplot(1, 2, 2)
    binned_summary = eda_helper.plot_binned_analysis(
        df,
        x_col_name,
        y_col_name,
        x_label,
        y_label,
        bins=5,
        labels=["Very Small", "Small", "Medium", "Large", "Very Large"],
    )

    plt.tight_layout()
    plt.show()

    return correlation, binned_summary
In [219]:
# Run analysis
correlation, area_summary = analyze_area_vs_sales(df)
In [220]:
print(f"Correlation coefficient: {correlation:.3f}")
print("\nAverage Sales by MRP Range:")
print(area_summary.round(0))
Correlation coefficient: -0.001

Average Sales by MRP Range:
Product_Allocated_Area
Very Small    3458.0
Small         3498.0
Medium        3412.0
Large         3454.0
Very Large    3475.0
Name: Product_Store_Sales_Total, dtype: float64

⚡ Insights

  • No relationship found: Near-zero correlation (-0.001) indicates shelf space allocation doesn't drive sales performance

  • All allocation categories show similar sales (₹3,412-₹3,498), suggesting current allocation strategy is already optimized

  • Contradicts retail assumptions: Unlike typical retail wisdom, more display area doesn't translate to higher sales at SuperKart

Product_Allocated_Area may be less predictive than expected

3 Product Type vs Sales¶

🔭 Possible outcome

  • Category performance ranking
  • (validates grouping decisions, as a part of Feature Engineering)
In [221]:
sns.boxplot(data=df, x='Product_Type', y='Product_Store_Sales_Total')
plt.xticks(rotation=45)
plt.title('Sales Distribution by Product Type')
plt.xlabel('Product Type')
plt.ylabel('Sales Total (₹)')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
In [222]:
avg_sales = df.groupby('Product_Type')['Product_Store_Sales_Total'].mean().sort_values(ascending=False)

sns.barplot(x=avg_sales.index, y=avg_sales.values)
plt.xticks(rotation=45)
plt.title('Average Sales by Product Type')
plt.xlabel('Product Type')
plt.ylabel('Average Sales (₹)')

# Add value labels
for index, value in enumerate(avg_sales.values):
    plt.text(index, value + 50, f'{value:.0f}', ha='center', va='bottom')

plt.grid(True, axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
In [223]:
avg_sales
Out[223]:
Product_Type
Starchy Foods            3679.251418
Others                   3586.068212
Seafood                  3584.263684
Breads                   3574.711200
Dairy                    3532.560352
Snack Foods              3471.711880
Household                3465.865095
Frozen Foods             3464.834562
Soft Drinks              3462.513911
Health and Hygiene       3445.393646
Meat                     3445.326764
Fruits and Vegetables    3443.421353
Baking Goods             3425.958101
Breakfast                3416.324623
Canned                   3397.463383
Hard Drinks              3364.594731
Name: Product_Store_Sales_Total, dtype: float64

⚡ Insights

  • Narrow performance range: All categories perform within ₹315 range (₹3,365-₹3,679), indicating product type has modest impact on sales compared to MRP

  • Staple foods lead: Top performers (Starchy Foods, Others, Seafood, Breads, Dairy) suggest essential/cooking ingredients drive higher sales

  • Beverages underperform: Hard Drinks rank lowest while Soft Drinks are mid-tier, indicating different consumption patterns across drink categories

Grouping validation: 🤔 Small performance differences between many categories statistically support combining rare categories during preprocessing

4. Store Size vs Sales¶

🔭 Possible outcome

  • Size-sales relationship, scale benefits
In [224]:
# Box Plot
sns.boxplot(data=df, x='Store_Size', y='Product_Store_Sales_Total')
plt.title('Sales Distribution across Store Sizes')
plt.xlabel('Store Size')
plt.ylabel('Sales Total (₹)')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
In [225]:
avg_sales = df.groupby('Store_Size')['Product_Store_Sales_Total'].mean().sort_values(ascending=False)

sns.barplot(x=avg_sales.index, y=avg_sales.values)
plt.xticks(rotation=45)
plt.title('Average Sales by Product Type')
plt.xlabel('Product Type')
plt.ylabel('Average Sales (₹)')

# Add value labels
for index, value in enumerate(avg_sales.values):
    plt.text(index, value + 50, f'{value:.0f}', ha='center', va='bottom')

plt.grid(True, axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
In [226]:
avg_sales
Out[226]:
Store_Size
High      3923.778802
Medium    3668.222573
Small     1762.942465
Name: Product_Store_Sales_Total, dtype: float64

⚡ Insights

  • Strong size advantage: High stores generate 2.2x more sales than Small stores (₹3,924 vs ₹1,763), confirming clear economies of scale

  • Small store challenge: Dramatic performance gap between Small (₹1,763) and Medium (₹3,668) suggests different business models rather than gradual scaling

  • Business insight: Store size is a strong predictor - expansion to larger formats could significantly boost SuperKart's revenue per location

5. Store City vs Sales¶

🔭 Possible outcome

  • Tier city performance differences
In [227]:
# Box Plot
sns.boxplot(data=df, x='Store_Location_City_Type', y='Product_Store_Sales_Total')
plt.title('Sales Distribution across Store Locations')
plt.xlabel('Store Location')
plt.ylabel('Sales Total (₹)')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
In [228]:
# Order by logical tier progression
tier_order = ['Tier 1', 'Tier 2', 'Tier 3']
tier_sales = df.groupby('Store_Location_City_Type')['Product_Store_Sales_Total'].mean().reindex(tier_order)

sns.barplot(x=tier_sales.index, y=tier_sales.values)
plt.xticks(rotation=45)
plt.title('Average Sales by Store Location')
plt.xlabel('Store Location')
plt.ylabel('Average Sales (₹)')

# Add value labels
for index, value in enumerate(tier_sales.values):
    plt.text(index, value + 50, f'{value:.0f}', ha='center', va='bottom')

plt.grid(True, axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
In [229]:
tier_sales
Out[229]:
Store_Location_City_Type
Tier 1    4946.966323
Tier 2    3457.473109
Tier 3    1762.942465
Name: Product_Store_Sales_Total, dtype: float64
In [230]:
# Get tier cities frequencies in pct
tier_cities_freq = df['Store_Location_City_Type'].value_counts(normalize=True)
tier_cities_freq
Out[230]:
Store_Location_City_Type
Tier 2    0.714595
Tier 1    0.153943
Tier 3    0.131462
Name: proportion, dtype: float64

⚡ Insights

  • Clear tier hierarchy: Tier 1 cities deliver 43% higher sales than Tier 2 (₹4,947 vs ₹3,457), confirming economic tier advantages

  • Quality over quantity: 🧐 Despite Tier 2 cities having 71% of SuperKart's locations, Tier 1 stores significantly outperform - showing fewer premium locations beat many average ones.

  • Significant market gaps: 2.8x performance difference between Tier 1 and Tier 3 cities indicates distinct market dynamics rather than gradual scaling

  • Strategic insight: Tier 1 focus could maximize revenue, while Tier 3 performance matches Small stores suggesting similar constraints

MultiVariate Analysis

6. Product Type, Store Size & Sales¶

  • helps identify which product types sell better in which store sizes.
In [231]:
# Pivot table: Average sales for each (Product_Type, Store_Size) combination
pivot = df.pivot_table(index='Product_Type',
                       columns='Store_Size',
                       values='Product_Store_Sales_Total',
                       aggfunc='mean')

sns.heatmap(pivot, annot=True, fmt=".0f", cmap="YlGnBu")
plt.title("Average Sales by Product Type and Store Size")
plt.xlabel("Store Size")
plt.ylabel("Product Type")
plt.tight_layout()
plt.show()
In [232]:
pivot
Out[232]:
Store_Size High Medium Small
Product_Type
Baking Goods 3861.257647 3632.219959 1769.380208
Breads 4042.469667 3743.188299 1887.803043
Breakfast 3816.110000 3710.780370 1559.740000
Canned 3773.246891 3616.167383 1721.223409
Dairy 3991.784133 3753.251365 1720.078654
Frozen Foods 3933.498662 3646.352236 1785.108416
Fruits and Vegetables 3984.887387 3638.704218 1776.806905
Hard Drinks 4024.230000 3547.559576 1809.395000
Health and Hygiene 3815.836053 3697.496667 1809.459451
Household 3965.458060 3653.563518 1846.656500
Meat 3891.286769 3669.687406 1744.827701
Others 3999.260968 3808.747327 1728.196316
Seafood 4072.064615 3807.619811 1766.335000
Snack Foods 3990.803168 3654.852859 1748.750479
Soft Drinks 3873.100849 3654.380855 1674.328226
Starchy Foods 3763.874375 3899.850412 1670.415000
In [233]:
avg_sales
Out[233]:
Store_Size
High      3923.778802
Medium    3668.222573
Small     1762.942465
Name: Product_Store_Sales_Total, dtype: float64

⚡ Insights

  • Larger Stores Drive Higher Sales Across All Product Types
  • Starchy Foods Stand Out in Medium Stores
  • Breakfast saws steep drop in average sales in 'Small Stores', suggesting these may be space-sensitive or impulse-driven products more effectively sold in larger formats.

7. Product Type, Store Type & Sales¶

  • Which products thrive in which store formats
In [234]:
# Pivot table for average sales per Product_Type and Store_Type
pivot = df.pivot_table(
    index='Product_Type',
    columns='Store_Type',
    values='Product_Store_Sales_Total',
    aggfunc='mean'
)
In [235]:
# Plot: Grouped bar
pivot.plot(kind='bar', figsize=(14, 6), width=0.8, colormap='crest')

plt.title('Average Sales by Product Type across Store Types')
plt.ylabel('Average Sales (₹)')
plt.xlabel('Product Type')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.legend(title='Store Type')
plt.tight_layout()
plt.show()
In [236]:
# Same pivot as above
sns.heatmap(pivot, annot=True, fmt=".0f", cmap='flare', linewidths=0.5)

plt.title('Heatmap of Average Sales: Product vs Store Type')
plt.xlabel('Store Type')
plt.ylabel('Product Type')
plt.tight_layout()
plt.show()
In [237]:
# Summary statistics
print("Top 3 combinations by sales:")
combinations = df.groupby(['Product_Type', 'Store_Type'])['Product_Store_Sales_Total'].mean().sort_values(ascending=False)
print(combinations.head(3).round(0))
Top 3 combinations by sales:
Product_Type   Store_Type        
Breads         Departmental Store    5159.0
Starchy Foods  Departmental Store    5126.0
Breakfast      Departmental Store    5033.0
Name: Product_Store_Sales_Total, dtype: float64
In [238]:
pivot
Out[238]:
Store_Type Departmental Store Food Mart Supermarket Type1 Supermarket Type2
Product_Type
Baking Goods 4968.769697 1769.380208 3861.257647 3288.535740
Breads 5158.586176 1887.803043 4042.469667 3317.316372
Breakfast 5033.372632 1559.740000 3816.110000 3305.469839
Canned 5027.168556 1721.223409 3773.246891 3281.982895
Dairy 4936.654759 1720.078654 3991.784133 3321.025945
Frozen Foods 4898.429672 1785.108416 3933.498662 3303.855717
Fruits and Vegetables 4930.975055 1776.806905 3984.887387 3302.713800
Hard Drinks 4815.665217 1809.395000 4024.230000 3240.544526
Health and Hygiene 4934.148090 1809.459451 3815.836053 3367.969790
Household 4897.024673 1846.656500 3965.458060 3320.104010
Meat 4914.525283 1744.827701 3891.286769 3222.389729
Others 4998.867188 1728.196316 3999.260968 3256.807681
Seafood 5025.960000 1766.335000 4072.064615 3411.659250
Snack Foods 4938.228172 1748.750479 3990.803168 3266.710081
Soft Drinks 4933.058108 1674.328226 3873.100849 3312.784765
Starchy Foods 5126.378571 1670.415000 3763.874375 3402.128841

⚡ Insights

  • Departmental Store dominance: Departmental Stores dominate sales across all product types, indicating stronger customer engagement or product availability.

  • Staple foods excel in premium formats: Breads, Starchy Foods, and Breakfast items perform exceptionally well in Departmental Stores, likely driven by routine or impulse buys, shows essential items thrive in full-service environments

  • Food Mart underperformance: Food Marts lag behind in sales for most categories, suggesting possible limitations in reach, assortment, or store scale.

  • Strategic insight: Expanding Departmental Store format could boost performance across entire product portfolio rather than optimizing specific product-store matches

  • Format-agnostic performance: Similar relative product rankings across store types suggest product characteristics matter more than format-specific synergies


Preprocessing¶


REMEMBER 🧠¶

🎯 Decision Flow

  1. Data Type? → Numeric: Scale → Categorical: Encode
  2. Missing Values? → Impute
  3. Outliers? → Handle (remove/clip/robust methods)
  4. Feature Count? → High: Select → Low: Keep all
  5. Interpretability? → High: Linear/Ridge → Low: Any regularization
  6. Overfitting? → Yes: Increase regularization → No: Decrease
  7. Performance? → Good: Done → Poor: Feature engineering

🔗 Pipeline Order

  1. Handle missing values (imputation)
  2. Transform (log, box-cox, etc.)
  3. Scale (StandardScaler, etc.)
  4. Feature selection (if needed)

Transformation vs Scaling - Key Distinction:

Transformation = Changes distribution shape

  • Log transform (for skewed data)
  • Square root, polynomial transforms
  • Purpose: Fix skewness, non-linearity

Scaling = Changes range/magnitude but preserves shape

  • StandardScaler, MinMaxScaler, RobustScaler
  • Purpose: Put features on similar scales

Simple Example:

Product_Weight: [8, 12, 16] kg

Product_MRP: [100, 150, 200] rupees

Without scaling: Linear regression might favor MRP (larger numbers)

With scaling: Both features get equal importance consideration

MinMaxScaler maintains relative distances (1 → 2 → 3 becomes 0 → 0.5 → 1)

StandardScaler maintains order too (but re-centers and re-spreads)

➡️ For linear models, StandardScaler is preferred due to better optimization behavior, even for encoded ordinal features.

NOTE:

If the values imply order, scale them.

If the values are just binary flags, don’t scale.

GOOD Practices (TIPS)

  • Keeps the pipeline uniform and consistent

  • Prevents this feature from being treated on a different scale than the rest

  • Avoids manual branching (i.e., "scale this numeric but not that one")

Linear Regression: Transformation vs Scaling - Rule of Thumb¶

  • Transformation = Changes distribution shape (log, sqrt) → Fix skewness/non-linearity
  • Scaling = Changes magnitude, preserves shape → Put features on equal footing

Scaling in Linear Regression Why Almost Always Needed:

  • Undue advantage problem: Large-scale features (e.g., area=2000) dominate small-scale features (e.g., rooms=3)
  • Algorithm bias: Loss function responds more to large-magnitude changes
  • Coefficient quality: Large-scale features get "granular optimization", small-scale features get "lazy optimization"

Gradient Descent Behavior:

  • Large-scale feature coefficients → Get refined through many iterations ✅
  • Small-scale feature coefficients → Barely updated, stuck at suboptimal values ❌

When to Scale/Transform

  • Basic OLS: Scaling recommended (coefficient interpretability)
  • Regularized LR: Scaling mandatory (Ridge/Lasso penalties)
  • Transformation: Statistical decision based on distribution analysis

Memory Hook "Features with larger raw values get artificial loudness in the loss function - scaling makes everyone speak at the same volume."

Imbalance for Regression¶

📌 NOTE:

Classification algorithms:

  • Can get biased toward majority class
  • Need sampling techniques (SMOTE, etc.)

Linear Regression:

  • Learns coefficients based on available data
  • No inherent bias toward frequent categories

In addition 👀

  • "Imbalance treatment" (like resampling or class weighting) is typically required for target variable imbalance in classification tasks — not for feature columns.

  • For categorical feature columns, it's natural for some categories to dominate. That's simply part of the data distribution.

Encoding & High Cardinality¶

📌 Point in Question:

“Due to high cardinality, encoding options like target encoding or learned embeddings may be more effective than simple one-hot encoding.”

Generally

  • ✅ One-Hot Encoding (OHE) → Preferred for Linear Models (like Linear Regression), since they assume linearity and independence between features.

  • ✅ Target Encoding → Works well with Tree-based models (like Random Forest, XGBoost), since they can handle ordinal relationships and benefit from the signal in the target.

💡 OHE can hurt tree models by increasing dimensionality unnecessarily, while target encoding may leak info in linear models.


NOTE: Use below cell if want detail report for all columns¶

In [239]:
# from ydata_profiling import ProfileReport

# # Quick outlier summary
# profile = ProfileReport(df, title="Outlier & Data Profile", explorative=True)
# profile.to_notebook_iframe()  # or profile.to_file("report.html")

Outlier Detection¶

🤔 Why IQR over Z-score:

  • More robust: Works with skewed data (like our Product_Allocated_Area)
  • No distribution assumptions: Doesn't require normal distribution
  • Industry standard: More commonly used in practice
In [240]:
def detect_outliers_iqr(df):
    """
    Detect outliers using IQR method (consistent with our EDA approach)
    """
    numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns

    outlier_summary = []

    for col in numerical_cols:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
        outlier_count = len(outliers)
        outlier_percentage = (outlier_count / len(df)) * 100

        outlier_summary.append({
            'Column': col,
            'Outlier_Count': outlier_count,
            'Outlier_Percentage': round(outlier_percentage, 2)
        })

    return pd.DataFrame(outlier_summary)
In [241]:
# Quick outlier check
outlier_results = detect_outliers_iqr(df)
print(outlier_results)
                      Column  Outlier_Count  Outlier_Percentage
0             Product_Weight             54                0.62
1     Product_Allocated_Area            104                1.19
2                Product_MRP             57                0.65
3   Store_Establishment_Year              0                0.00
4  Product_Store_Sales_Total            119                1.36

📌 Rule of thumb: If outliers < 5% of data, skewness between -1 and 1, and reasonable scale → no target preprocessing needed!

🔍 Observations

  • 1% is typically acceptable:
  • Most real-world datasets have 1-5% outliers
  • IQR method can be conservative (finds more outliers than Z-score)
  • 1% suggests data quality is quite good

📊Key Observations:

  • Minimal outlier presence: All features show <1.5% outliers, indicating high data quality across SuperKart's dataset

  • Target variable outliers: 119 outliers (1.36%) in sales data likely represent legitimate high-performing product-store combinations, not data errors

  • Store_Establishment_Year clean: Zero outliers confirm only 4 establishment years with consistent data entry

  • Treatment recommendation: Keep outliers as they represent valid business scenarios (premium products, exceptional performance) rather than data quality issues

  • Preprocessing decision: No outlier treatment needed - low percentages suggest authentic extreme values rather than measurement errors

No outlier treatment is needed for the target variable (Product_Store_Sales_Total) because:

  • 📉 Altering target values can distort the true signal your model is trying to learn.
  • 🧠 Models, especially tree-based ones (like CatBoost), are robust to such minor outliers.
  • 📊 The percentage of outliers is very low (~1.36%), indicating minimal risk of skewing model performance.

🔧 Preprocessing Plan of Action¶

We will build and evaluate two models:

  1. Linear Model – which requires numerical stability and feature scaling.
  2. Tree-Based Model (CatBoost) – which handles categorical features natively and is robust to scaling.

In this phase, we will:

  • Identify and define appropriate preprocessing steps for each model type.
  • Implement any necessary custom transformers (e.g., feature engineering).
  • Construct a preprocessing pipeline using ColumnTransformer.

The final model pipelines (preprocessing + estimator) will be built during the modeling phase.

🧠 REMEMBER

🔁 Typical pipeline order:

  1. Impute missing values
  2. Encode categorical features
  3. Transform (e.g., log, binning, etc.)
  4. Scale numerical features
  5. Model
  • All feature transformations (like encoding, imputation, log transforms, etc.) are ideally done before scaling.

Logical rationale:

  1. Missing values first - Need complete data before other operations
  2. Encode categoricals - Convert to numeric format
  3. Transform - Change distributions (log, etc.) on original scale
  4. Scale - Standardize the final transformed features
  5. Model - Use processed data

Key insight: Transform before scaling because you want to scale the final transformed distribution, not transform already-scaled values.

Also for Target Column:

  • showed good properties in EDA - near-normal distribution (mean ≈ median)
  • Transform target only if residual analysis (linear model case) show problems (check later)

Prep - Steps Setup (Preprocessing)¶

REMEMBER 💡¶

Interpretation Guideline (general - numeric) 🚀

Question Indicator
Scaling needed? For numeric with large range or skew - Yes
Log transform? If skew > 1 or < -1
Outliers? If outliers > 2–5%
Encoding? Categorical with low cardinality → OneHot; ordinal → OrdinalEncoder
  1. Is it skewed? (|skew| > 1)

    • YES → Transform first (log, box-cox)
    • NO → No transform needed
  2. Are there outliers? (>5% extreme values)

    • YES → Use RobustScaler
    • NO → Use StandardScaler

🧠 REMEMBER

Encoded Categorical Features:

  • Don't treat as true numericals - they're enumerated values
  • Skip outlier/skewness checks for one-hot encoding (0/1 values)
  • Maybe light preprocessing for ordinal/target encoding, but not full numerical treatment

True Numerical Features:

  • Full preprocessing needed - outliers, skewness, scaling, transformations
  • These actually represent continuous quantities

Rule of Thumb:

One-hot encoded → Leave alone (just 0s and 1s)
Ordinal encoded → Maybe scale, but don't overthink it  
Target encoded → Treat more like numerical
True numerical → Full preprocessing pipeline

Key insight: Don't lose sight of the original data nature just because encoding made it numeric!

Helper Snippets¶

In [242]:
def summarize_numeric_column(df, col):
    series = df[col].dropna()
    skewness = skew(series)
    outliers = (abs(zscore(series)) > 3).sum()
    outlier_pct = (outliers / len(series)) * 100

    print(f"\n📈 Column: {col}")
    print(f"Skewness: {skewness:.2f}")
    print(f"Outlier Count (Z > 3): {outliers} ({outlier_pct:.2f}%)")
    print(f"Min: {series.min()}, Max: {series.max()}")
    print(f"Mean: {series.mean():.2f}, Std: {series.std():.2f}")
In [243]:
def summarize_categorical_column(df, col):
    counts = df[col].value_counts()
    percentages = df[col].value_counts(normalize=True) * 100
    summary = pd.DataFrame({
        'Count': counts,
        'Percentage (%)': percentages.round(2)
    })
    imbalance_ratio = counts.max() / counts.min() if counts.min() > 0 else float('inf')

    print(f"\n🪪 Column: {col}")
    print(f"Unique Categories: {df[col].nunique()}")
    print(f"Imbalance Ratio (max/min): {imbalance_ratio:.2f}")
    print(summary)

🧠 REMEMBER 💡¶

Interpretation Guideline

Metric When to Act
Skewness > ±1 Consider log or power transformation
Outlier % > 2–5% Consider treatment (cap, impute, remove)
Imbalance Ratio > 4 May affect encoding; be cautious
Unique Categories = 2–5 OneHot/Ordinal (case-by-case)
Unique Categories > 10 High cardinality — use frequency/target encoding if needed (for tree models only)

💡 Z-Score Interpretation

  • Any data point with |Z| > 3 is in the extreme 0.3%, i.e., very far from the mean.
  • Hence, such points are considered statistical outliers.

NOTE:

As a general guideline:

  • ✅ |Skewness| ≤ 0.5 → Data is fairly symmetric → Z-score is reliable
  • ⚠️ 0.5 < |Skewness| < 1 → Moderate skew → Z-score can be used with caution
  • ❌ |Skewness| ≥ 1 → Highly skewed → Avoid Z-score, prefer IQR method

So, Z-score is best used when skewness is close to 0 (i.e., data is approximately normal).

Z-Score assumes a normal distribution. For skewed data, use the IQR method instead.

Evaluating Columns For Custom Transformer¶

In [244]:
summarize_numeric_column(df, 'Product_Weight')
📈 Column: Product_Weight
Skewness: 0.02
Outlier Count (Z > 3): 24 (0.27%)
Min: 4.0, Max: 22.0
Mean: 12.65, Std: 2.22
In [245]:
summarize_categorical_column(df, 'Product_Sugar_Content')
🪪 Column: Product_Sugar_Content
Unique Categories: 4
Imbalance Ratio (max/min): 45.23
                       Count  Percentage (%)
Product_Sugar_Content                       
Low Sugar               4885           55.75
Regular                 2251           25.69
No Sugar                1519           17.33
reg                      108            1.23
  • Low Cardinality
  • No need any imbalance treatment (its natural seemingly)
  • However, reg is most likely a mislabel of Regular, So we need to treat it correctly
In [246]:
summarize_numeric_column(df, 'Product_Allocated_Area')
📈 Column: Product_Allocated_Area
Skewness: 1.13
Outlier Count (Z > 3): 75 (0.86%)
Min: 0.004, Max: 0.298
Mean: 0.07, Std: 0.05
  • mild outliers (~1.2%) - no treatment required
  • Actions
    • Apply Power Transformation to reduce skew

PowerTransformer already standardizes the data by default, Hence Scaling also not needed for this !

In [247]:
summarize_categorical_column(df, 'Product_Type')
🪪 Column: Product_Type
Unique Categories: 16
Imbalance Ratio (max/min): 16.43
                       Count  Percentage (%)
Product_Type                                
Fruits and Vegetables   1249           14.25
Snack Foods             1149           13.11
Frozen Foods             811            9.25
Dairy                    796            9.08
Household                740            8.44
Baking Goods             716            8.17
Canned                   677            7.73
Health and Hygiene       628            7.17
Meat                     618            7.05
Soft Drinks              519            5.92
Breads                   200            2.28
Hard Drinks              186            2.12
Others                   151            1.72
Starchy Foods            141            1.61
Breakfast                106            1.21
Seafood                   76            0.87
In [248]:
df.groupby("Product_Type")["Product_Store_Sales_Total"].mean()
Out[248]:
Product_Type
Baking Goods             3425.958101
Breads                   3574.711200
Breakfast                3416.324623
Canned                   3397.463383
Dairy                    3532.560352
Frozen Foods             3464.834562
Fruits and Vegetables    3443.421353
Hard Drinks              3364.594731
Health and Hygiene       3445.393646
Household                3465.865095
Meat                     3445.326764
Others                   3586.068212
Seafood                  3584.263684
Snack Foods              3471.711880
Soft Drinks              3462.513911
Starchy Foods            3679.251418
Name: Product_Store_Sales_Total, dtype: float64

Observation 👀

  • Minimum avg sales: ~3364 (Hard Drinks)
  • Maximum avg sales: ~3679 (Starchy Foods)
  • Overall range: ≈ 315
  • This is less than 10% variance in average sales between most categories, which means:
  • ✅ Sales performance is fairly uniform across product types
    • Standalone predictive power: Almost zero
    • Feature Interactions (main reason) - might help in prediction

Reckoning and Insights 🧐

  • Some rare categories (Starchy Foods: 3679, Seafood: 3584) are TOP performers
  • Narrow range: Only 9% difference between best/worst (3679 vs 3365)
  • Bottom 6 categories have <200 samples each

Grouping Challanges:

  • Can't group by sample size alone: Would lose high-performing signals (Starchy Foods, Seafood)

Statistical reality check:

If we consider domain grouping as

  1. Beverages: Hard Drinks + Soft Drinks
  2. Carbs: Breads + Starchy Foods + Breakfast
  3. Protein: Meat + SeaFood
  • Rare categories spread: 3679 (Starchy) to 3365 (Hard Drinks) = 314 difference
  • Same 9% problem: Domain grouping still combines categories with significant differences
  • No real improvement: Beverages (3463+3365) vs Carbs (3575+3679+3416) still mixes high/low performers
  • Domain grouping doesn't solve the core issue - we're still mixing categories with 200-300 point differences.

Hence for now,

  • Merge all <200 sample categories into "Others"
  • Accept the 5% performance flattening
  • Focus on features with stronger signals (MRP, Store_Size, etc.)

🥇 Most Viable: Mapping to Others

  • Hence let rare types { Breads, Hard Drinks, Starchy Foods, BreakFast, Seafood } be clubbed into others
  • Simple and robust
  • Keep dimensionality lower

NOTE:

However For More Nuance we can try below as well (ie introduce 2 new groups):

  • Rare_High_Sales: Seafood, Starchy Foods, Breads
  • Rare_Low_Sales: Hard Drinks, Breakfast

So In case if model seems performing not well, this feature can be tweaked for better grouping !!

In [249]:
# Check sales similarity for categories being merged into "Others"
def analyze_rare_category_grouping(df):
    """
    Analyze the impact of merging rare categories into 'Others'
    """

    # Define categories to merge
    rare_categories = ['Breads', 'Hard Drinks', 'Starchy Foods', 'Breakfast', 'Seafood']

    # Current "Others" performance
    others_current = df[df['Product_Type'] == 'Others']['Product_Store_Sales_Total']
    print("📊 CURRENT 'OTHERS' CATEGORY")
    print(f"Count: {len(others_current)}")
    print(f"Average Sales: {others_current.mean():.0f}")

    # Rare categories individual performance
    print("\n📋 RARE CATEGORIES TO MERGE")
    rare_stats = []
    total_count = 0
    total_sales = 0

    for category in rare_categories:
        cat_data = df[df['Product_Type'] == category]['Product_Store_Sales_Total']
        avg_sales = cat_data.mean()
        count = len(cat_data)
        rare_stats.append((category, count, avg_sales))
        total_count += count
        total_sales += cat_data.sum()
        print(f"{category}: {count} samples, avg = {avg_sales:.0f}")

    # Combined "Others" after merging
    print("\n🔄 AFTER MERGING")
    new_others_count = len(others_current) + total_count
    new_others_avg = (others_current.sum() + total_sales) / new_others_count

    print(f"New 'Others' Count: {new_others_count}")
    print(f"New 'Others' Average: {new_others_avg:.0f}")

    # Impact analysis
    print("\n📈 IMPACT ANALYSIS")
    print(f"Change in 'Others' average: {new_others_avg - others_current.mean():.0f}")
    print(f"Range of rare categories: {min([x[2] for x in rare_stats]):.0f} - {max([x[2] for x in rare_stats]):.0f}")

    return new_others_avg, rare_stats
In [250]:
# Run analysis
new_avg, rare_stats = analyze_rare_category_grouping(df)
📊 CURRENT 'OTHERS' CATEGORY
Count: 151
Average Sales: 3586

📋 RARE CATEGORIES TO MERGE
Breads: 200 samples, avg = 3575
Hard Drinks: 186 samples, avg = 3365
Starchy Foods: 141 samples, avg = 3679
Breakfast: 106 samples, avg = 3416
Seafood: 76 samples, avg = 3584

🔄 AFTER MERGING
New 'Others' Count: 860
New 'Others' Average: 3530

📈 IMPACT ANALYSIS
Change in 'Others' average: -56
Range of rare categories: 3365 - 3679

👀 Merging creates robust, well-balanced category without extreme distortions

Hence introduce new step maybe group_rare_labels or something

In [251]:
summarize_numeric_column(df, 'Product_MRP')
📈 Column: Product_MRP
Skewness: 0.04
Outlier Count (Z > 3): 25 (0.29%)
Min: 31.0, Max: 266.0
Mean: 147.03, Std: 30.69

Decision: Use raw values + scaling only.

In [252]:
summarize_categorical_column(df, 'Store_Establishment_Year')
🪪 Column: Store_Establishment_Year
Unique Categories: 4
Imbalance Ratio (max/min): 4.06
                          Count  Percentage (%)
Store_Establishment_Year                       
2009                       4676           53.36
1987                       1586           18.10
1999                       1349           15.39
1998                       1152           13.15

🔍 Observations

  • Drop this column and Get Age instead of that (universally)
  • MinMaxScaler is better here

With 4 discrete values:

  • StandardScaler assumes normality - violated with discrete data
  • MinMaxScaler: No distributional assumptions, just linear scaling [0,1]
  • Clean interpretation: Older stores → higher values, newer stores → lower values

✅ Best Choice for Store_Age: MinMaxScaler

Why?

  • Works better when few discrete values are involved
  • Keeps the values in the same order and spacing, but normalized
  • Tree models don’t need it, but for Linear models, it avoids distortion that StandardScaler could introduce with such low cardinality
In [253]:
summarize_categorical_column(df, 'Store_Size')
🪪 Column: Store_Size
Unique Categories: 3
Imbalance Ratio (max/min): 5.23
            Count  Percentage (%)
Store_Size                       
Medium       6025           68.75
High         1586           18.10
Small        1152           13.15
  • Use ordinal encoding to preserve size progression
  • Then apply MinMax Scaling (if needed) to consolidate range !
    • Though not needed as there are only 3 unique values !!
In [254]:
summarize_categorical_column(df, 'Store_Location_City_Type')
🪪 Column: Store_Location_City_Type
Unique Categories: 3
Imbalance Ratio (max/min): 5.44
                          Count  Percentage (%)
Store_Location_City_Type                       
Tier 2                     6262           71.46
Tier 1                     1349           15.39
Tier 3                     1152           13.15

NOTE: ✅

Even though Store_Location_City_Type is categorical in origin, here's why scaling is needed in the context of a linear model:


⚠️ Reason: Effect on Linear Model Coefficients

  • Once you apply ordinal encoding (Tier 1 → 1, Tier 2 → 2, Tier 3 → 3), it becomes a numeric feature — the model will treat it like a real-valued input.

  • Linear models interpret magnitude and scale literally - a larger value means more influence unless scaled.


🧠 Key Points:

  • Ordinal encoding preserves order, but not relative spacing.
  • Tier 3 (3) vs Tier 1 (1) has 2x the weight in raw encoded form - linear models treat this as meaningful unless corrected by scaling.
  • StandardScaler standardizes the spread, so the model treats this more fairly alongside other features.

So, despite the feature’s categorical origin, scaling is needed post-encoding for:

  • Coefficient stability
  • Convergence in gradient descent
  • Balanced influence across features

This is especially important when using regularization (like in Ridge or Lasso).

💡 Scaling reduces feature dominance and improves convergence.

In [255]:
summarize_categorical_column(df, 'Store_Type')
🪪 Column: Store_Type
Unique Categories: 4
Imbalance Ratio (max/min): 4.06
                    Count  Percentage (%)
Store_Type                               
Supermarket Type2    4676           53.36
Supermarket Type1    1586           18.10
Departmental Store   1349           15.39
Food Mart            1152           13.15
  • Use OneHotEncoder (Nominal Categories)

Code for Setup¶

🧠 REMEMBER¶

1. Custom Transformers + Serialization: Key Points¶

❗ Problem:

Saving a pipeline with a custom transformer causes error when loading:

refer: https://stackoverflow.com/questions/49621169/joblib-load-main-attributeerror

AttributeError: Can't get attribute 'MyTransformer' on <module 'main'>

✅ Why:

  • defined the transformer in a notebook/script (__main__)
  • pickle/joblib can’t find the class when reloading (because it dumps with modulename)

✅ Best Practice:

  1. Put custom class in a separate file (ie seperate module) (eg my_transformers.py)
  2. Use it in your training code (eg notebook.py)
  3. Load safely in another script (eg backend.py)

🔁 TL;DR: Points to Remember

  • ✅ Define custom transformers in .py files, not inline
  • ✅ Import them before loading model
  • ❌ Avoid defining in __main__ or notebook

Instead of the hacky main reassignment, having the class in its own file with proper imports will ensure consistent module paths during both serialization and deserialization.

Eg of hacky assignment => `main.ClassName = ClassName # right one is reachable in local project (eg backend)

2. Column Order Matters For Model¶

⚠️ The column ordering issue with CatBoost.

While pandas DataFrames are flexible with column access by name, the underlying CatBoost model expects features in the exact same order as it was trained with. This is a common gotcha when working with many ML models, not just CatBoost, since they ultimately work with numpy arrays where position matters more than labels.

📝 Lesson: Column Order Matters in ML Pipelines (Even with Column Names!)

🔍 Root Cause

  • trained pipeline is expecting input columns in the exact same order (and names) as during training.
  • if your test DataFrame has columns in a different order than the training data, it sends the wrong data to the wrong features, causing wrong predictions or even errors.

✨ Tips

  • Always unify column order in first step (ie Data Cleaning) before preprocessing actually begins !!
3 🧠 (Pipeline Custom Transformer)¶

General Rule

  • get_feature_names_out() should ONLY return feature column names
  • Target column (y) is NOT included in the returned names
  • This is because transformers operate on features (X) independently of the target

Reasoning

  • Transformers follow the X/y separation principle in scikit-learn
  • The target variable (y) is handled separately in the pipeline
  • Feature transformations should be independent of the target
  • This maintains the ability to use the transformer for prediction where y is unknown

Key Points

  • Only includes names of columns that come out of transform()
  • Even if target column is present in input DataFrame, ignore it in get_feature_names_out()

Pipeline Context

  • Transformers focus on X transformation
  • Target transformers (for y) are separate pipeline components

Remember: get_feature_names_out() is about feature tracking, not target tracking. The target variable is handled separately in the ML pipeline.

Flow (Pipeline)

Pipeline.fit(X, y)

For Transformers in pipeline:

  • y is passed to their fit() method (as optional parameter)
  • but their transform() only gets X
  • transformed X flows to next step
  • original y remains unchanged
pipeline.fit(X, y):
    Step 1 (Transformer): 
        - fit(X, y)      # y available but usually unused
        - transform(X)   # only X, produces X1

    Step 2 (Transformer):
        - fit(X1, y)     # y still original
        - transform(X1)  # only X1, produces X2

    Final Step (Model):
        - fit(X2, y)     # model gets transformed X2 and original y

Why this

  • Prevents accidental target leakage in transformations
  • Maintains clean separation between feature and target processing
  • Ensures y remains pristine for final model training

This is why transformers are designed to work primarily with X, even though y is technically available in their fit() method.

🤖 Custom Transformers (Pipeline - Helpers/Utils)¶

In [256]:
# make dir for preprocessing helper functions
os.makedirs('preprocessing', exist_ok=True)

1 Data Preppers Helper file

In [257]:
%%writefile preprocessing/custom_data_prepper.py
from sklearn.base import BaseEstimator, TransformerMixin
from datetime import datetime
import numpy as np

class CustomDataPrepper(BaseEstimator, TransformerMixin):
    """
    Custom transformer for preparing data in the SuperKart model pipeline.
    that performs:
    a. data cleaning
    b. data preparation (ie before feature engineering)
    ---
    For now
    - Product_Sugar_Content: 'reg' → 'Regular'
    - Store_Establishment_Year <-> Store_Age
    - Unify col order
    - Drop Nominal Columns
    ---
    NOTE:
    - This will work only on DataFrames (ie expects Column with Names to exists)
    """

    # Constants for column names
    STORE_AGE_COL = 'Store_Age'

    # Define expected column order (for Predictors)
    EXPECTED_X_COLUMNS = [
        'Product_Weight',
        'Product_Sugar_Content',
        'Product_Allocated_Area',
        'Product_Type',
        'Product_MRP',
        'Store_Size',
        'Store_Location_City_Type',
        'Store_Type',
        'Store_Age',  # transformed from Store_Establishment_Year
        # 'Product_Store_Sales_Total'  # target variable
    ]

    def __init__(self):
        """Initialize the data cleaner"""
        pass

    def fit(self, X, y=None):
        """
        Fit method (required for sklearn compatibility)
        No fitting needed for data cleaning operations
        """
        return self

    def transform(self, X):
        """
        Apply data cleaning transformations

        Args:
            X: DataFrame to clean

        Returns:
            DataFrame with cleaned data
        """
        # Create a copy to avoid modifying original data
        X_cleaned = X.copy()

        # 1. Fix Product_Sugar_Content inconsistency (reg → Regular)
        if 'Product_Sugar_Content' in X_cleaned.columns:
            X_cleaned['Product_Sugar_Content'] = X_cleaned['Product_Sugar_Content'].replace(
                'reg', 'Regular'
            )

        # 2. Store_Establishment_Year -> Store_Age
        if 'Store_Establishment_Year' in X_cleaned.columns:
            current_year = datetime.now().year
            X_cleaned[self.STORE_AGE_COL] = np.maximum(current_year - X_cleaned['Store_Establishment_Year'], 0)
            # drop original column
            X_cleaned = X_cleaned.drop('Store_Establishment_Year', axis=1)

        # 3. Drop Nominal Columns
        X_cleaned = self.drop_columns(X_cleaned)

        # 4. Unify column order (MIMP <--)
        # ---
        # !!
        # Column order is critical for CatBoostRegressor as it relies on column positions
        # If column order differs between training and inference, wrong features will be
        # used leading to incorrect predictions
        # ---
        #available_expected_cols = [col for col in self.EXPECTED_COLUMNS if col in X_cleaned.columns]
        X_cleaned = X_cleaned[self.EXPECTED_X_COLUMNS]

        # Add other data cleaning rules here as needed
        # Example: Fix whitespace, case issues, etc.

        return X_cleaned

    def get_feature_names_out(self, input_features=None):
        """
        Return feature names for output features.
        Required for sklearn compatibility.

        IMPORTANT:
        - This transformer STRICTLY expects a DataFrame with specific columns
        - We intentionally return EXPECTED_X_COLUMNS as transform() ensures this structure
        - If input structure doesn't match, transform() will fail earlier with KeyError
        - This strict behavior is INTENDED to catch data structure mismatches early

        Returns:
            list: EXPECTED_X_COLUMNS - the fixed ordered list of feature names
                that transform() guarantees to output
        """
        # if input_features is None:
        #     return input_features

        # # Create list of output features
        # output_features = list(input_features)

        # # Handle Store_Establishment_Year -> Store_Age transformation
        # if 'Store_Establishment_Year' in output_features:
        #     output_features.remove('Store_Establishment_Year')
        #     output_features.append(self.STORE_AGE_COL)

        # return output_features

        # ?? Not we can return EXPECTED_X_COLUMNS as it's ensured in last step of transform()
        return self.EXPECTED_X_COLUMNS


    def drop_columns(self, X):
        cols_to_drop = ['Product_Id', 'Store_Id']
        existing_cols = [col for col in cols_to_drop if col in X.columns]
        return X.drop(columns=existing_cols)
Overwriting preprocessing/custom_data_prepper.py
In [258]:
# !! This Cell was used just for testing !!

# # As earlier we wrote file to directory we can now import and use it here
# from preprocessing.custom_data_prepper import CustomDataPrepper

2. Debug Transformer Helper

In [259]:
%%writefile preprocessing/debug_transformer.py
from sklearn.base import BaseEstimator, TransformerMixin

'''
NOTE:

ENABLE_DEBUG: decide if to keep debugging feature on or off
[active]: is individual instance config (used esp to retain config whilst serialization)
'''

class DebugTransformer(BaseEstimator, TransformerMixin):
    # This is global config
    ENABLE_DEBUG = False

    # This is a custom transformer to debug the data after preprocessing
    # [active]: is individual instance config (used esp to retain config whilst serialization)
    def __init__(
        self,
        message="Debug Step",
        show_head=False,
        n_rows=2,
        active=True,
    ):
        self.message = message
        self.show_head = show_head
        self.n_rows = n_rows
        self.active = active

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        if not self.ENABLE_DEBUG: # Global config
            return X
        if not self.active: # Individual instance/local config
            return X

        print(f"\n🪵 {self.message} ---- (D)")

        if hasattr(X, "columns"):
            # It's a Dataframe
            print(f"Shape: {X.shape}")
            print(f"Columns: {list(X.columns)}")
            if self.show_head:
                print(f"First {self.n_rows} rows:")
                print(X.head(self.n_rows))
        else:
            # It's a numpy array
            print("Data type: numpy array")
            if self.show_head:
                print("Sample row:")
                print(X[: self.n_rows])  # First 3 rows of numpy array

        # print(f"Shape: {X.shape}")
        # print(f"Columns: {list(X.columns) if hasattr(X, 'columns') else 'No column names'}")
        # print(f"Data types: {X.dtypes.to_dict() if hasattr(X, 'dtypes') else 'Array data'}")
        # print("Sample data:")
        # print(X.head(3))
        return X

    def get_feature_names_out(self, input_features=None):
        return input_features

    def turn_on(self):
        '''
        Turn on the debug feature for this instance
        '''
        self.active = True

    def turn_off(self):
        '''
        Turn off the debug feature for this instance
        '''
        self.active = False
Overwriting preprocessing/debug_transformer.py

3. index file for preprocessing folder

In [260]:
%%writefile preprocessing/__init__.py
"""Preprocessing module for data preparation and debugging transformers."""

from .custom_data_prepper import CustomDataPrepper
from .debug_transformer import DebugTransformer

__all__ = [
    'CustomDataPrepper',
    'DebugTransformer',
]

# # Set default debug mode
# DebugTransformer.ENABLE_DEBUG = False
Overwriting preprocessing/__init__.py

✅ Now Our preprocessing helpers are ready !!

In [261]:
from preprocessing import DebugTransformer, CustomDataPrepper

⚡ NOTE: Remember to take same preprocessing folder to backend so that deserialized model can find the classes correctly

In [262]:
def test_custom_data_prepper():
    """
    Test the CustomDataPrepper with various scenarios including:
    1. Basic transformations (Sugar Content, Store Age)
    2. Column order enforcement
    3. ID columns dropping
    4. Error cases for missing required columns
    """
    # Create sample data with all expected columns
    test_data = pd.DataFrame({
        # Required columns as per EXPECTED_X_COLUMNS
        'Product_Weight': [100, 200, 150, 300, 250, 180],
        'Product_Sugar_Content': ['Regular', 'reg', 'Low Sugar', 'No Sugar', 'reg', 'Low Sugar'],
        'Product_Allocated_Area': [10, 15, 20, 25, 30, 35],
        'Product_Type': ['Dairy', 'Breads', 'Snack Foods', 'Seafood', 'Others', 'Breakfast'],
        'Product_MRP': [50, 30, 25, 45, 35, 40],
        'Store_Size': ['Small', 'Medium', 'Large', 'Medium', 'Small', 'Large'],
        'Store_Location_City_Type': ['Tier 1', 'Tier 2', 'Tier 1', 'Tier 3', 'Tier 2', 'Tier 1'],
        'Store_Type': ['Grocery Store', 'Supermarket', 'Grocery Store', 'Supermarket', 'Grocery Store', 'Supermarket'],
        'Store_Establishment_Year': [1985, 1990, 1995, 2000, 2025, 1987],  # Will be transformed to Store_Age

        # Additional columns that should be dropped
        'Product_Id': ['P1', 'P2', 'P3', 'P4', 'P5', 'P6'],
        'Store_Id': ['S1', 'S2', 'S3', 'S4', 'S5', 'S6'],
    })

    # Test the transformer
    prepper = CustomDataPrepper()
    prepper.fit(test_data)
    cleaned_data = prepper.transform(test_data)

    # 1. Test Sugar Content Standardization
    print("=== Product_Sugar_Content Cleaning ===")
    print("Before:")
    print(test_data['Product_Sugar_Content'].value_counts())
    print("\nAfter:")
    print(cleaned_data['Product_Sugar_Content'].value_counts())

    # 2. Test Store Age Transformation
    current_year = datetime.now().year
    print("\n=== Store_Establishment_Year → Store_Age ===")
    print("Before (Establishment Year):")
    print(test_data['Store_Establishment_Year'].describe())
    print("\nAfter (Store Age):")
    print(cleaned_data['Store_Age'].describe())

    # 3. Test Column Order
    print("\n=== Column Order Verification ===")
    print("Expected columns:", prepper.EXPECTED_X_COLUMNS)
    print("Actual columns:", cleaned_data.columns.tolist())
    assert list(cleaned_data.columns) == prepper.EXPECTED_X_COLUMNS, "Column order mismatch!"

    # 4. Test ID Columns Dropping
    print("\n=== ID Columns Dropping ===")
    print("Dropped columns:", set(test_data.columns) - set(cleaned_data.columns))
    assert 'Product_Id' not in cleaned_data.columns, "Product_Id should be dropped"
    assert 'Store_Id' not in cleaned_data.columns, "Store_Id should be dropped"

    # 5. Test get_feature_names_out
    print("\n=== Feature Names Output ===")
    output_features = prepper.get_feature_names_out(test_data.columns)
    print("get_feature_names_out result:", output_features)
    assert output_features == prepper.EXPECTED_X_COLUMNS, "Feature names mismatch!"

    # 6. Test Error Case - Missing Required Columns
    print("\n=== Testing Error Case ===")
    try:
        incomplete_data = test_data.drop(['Product_Weight', 'Store_Size'], axis=1)
        prepper.transform(incomplete_data)
        print("ERROR: Should have raised KeyError for missing columns!")
    except KeyError as e:
        print("Successfully caught error for missing required columns:", e)

    print("\nAll tests completed successfully!")
In [263]:
# Run the tests
test_custom_data_prepper()
=== Product_Sugar_Content Cleaning ===
Before:
Product_Sugar_Content
reg          2
Low Sugar    2
Regular      1
No Sugar     1
Name: count, dtype: int64

After:
Product_Sugar_Content
Regular      3
Low Sugar    2
No Sugar     1
Name: count, dtype: int64

=== Store_Establishment_Year → Store_Age ===
Before (Establishment Year):
count       6.000000
mean     1997.000000
std        14.764823
min      1985.000000
25%      1987.750000
50%      1992.500000
75%      1998.750000
max      2025.000000
Name: Store_Establishment_Year, dtype: float64

After (Store Age):
count     6.000000
mean     28.000000
std      14.764823
min       0.000000
25%      26.250000
50%      32.500000
75%      37.250000
max      40.000000
Name: Store_Age, dtype: float64

=== Column Order Verification ===
Expected columns: ['Product_Weight', 'Product_Sugar_Content', 'Product_Allocated_Area', 'Product_Type', 'Product_MRP', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Store_Age']
Actual columns: ['Product_Weight', 'Product_Sugar_Content', 'Product_Allocated_Area', 'Product_Type', 'Product_MRP', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Store_Age']

=== ID Columns Dropping ===
Dropped columns: {'Store_Id', 'Product_Id', 'Store_Establishment_Year'}

=== Feature Names Output ===
get_feature_names_out result: ['Product_Weight', 'Product_Sugar_Content', 'Product_Allocated_Area', 'Product_Type', 'Product_MRP', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Store_Age']

=== Testing Error Case ===
Successfully caught error for missing required columns: "['Product_Weight', 'Store_Size'] not in index"

All tests completed successfully!

🧠 REMEMBER (reload imports)

When developing Python modules in Jupyter notebooks using %%writefile, changes to the files won't automatically reflect in your notebook due to Python's module caching.

Hence use

importlib.reload(mymodule)  # Refresh!

when need to manipulate code way !

In [264]:
# !! # This cell was used to Test and Tackle changes sync issues due to %%writefile & caching done by python whilst importing libraries

# # Either force reload the module:
# # Otherwise sometimes changes done via `%%writefile` wont sync
# import importlib
# import preprocessing.debug_transformer
# importlib.reload(preprocessing.debug_transformer)
# from preprocessing.debug_transformer import DebugTransformer
In [265]:
def test_debug_transformer():
    """
    Test the DebugTransformer with various scenarios:
    1. DataFrame handling
    2. NumPy array handling
    3. Debug control (global and instance level)
    4. Different configurations
    5. Feature names preservation
    """
    from preprocessing.debug_transformer import DebugTransformer

    # Helper function to capture print output
    def capture_output(func):
        # Redirect stdout to capture print statements
        old_stdout = sys.stdout
        output = StringIO()
        sys.stdout = output

        func()

        # Restore stdout
        sys.stdout = old_stdout
        return output.getvalue()

    # Test data
    df = pd.DataFrame({
        'numeric': [1, 2, 3, 4, 5],
        'categorical': ['A', 'B', 'C', 'D', 'E'],
        'float_col': [1.1, 2.2, 3.3, 4.4, 5.5]
    })
    np_array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

    print("=== Starting DebugTransformer Tests ===")

    # 1. Test DataFrame handling
    def test_dataframe():
        debug = DebugTransformer(
            message="Testing DataFrame",
            show_head=True,
            n_rows=2
        )
        DebugTransformer.ENABLE_DEBUG = True
        result = debug.transform(df)

        # Verify data unchanged
        assert result.equals(df), "Transform should not modify DataFrame"

        # Verify feature names preserved
        assert list(debug.get_feature_names_out(df.columns)) == list(df.columns)

    print("\nTesting DataFrame handling...")
    output = capture_output(test_dataframe)
    print("DataFrame test output:")
    print(output)

    # 2. Test NumPy array handling
    def test_numpy():
        debug = DebugTransformer(
            message="Testing NumPy Array",
            show_head=True,
            n_rows=2
        )
        DebugTransformer.ENABLE_DEBUG = True
        result = debug.transform(np_array)

        # Verify data unchanged
        assert np.array_equal(result, np_array), "Transform should not modify NumPy array"

    print("\nTesting NumPy array handling...")
    output = capture_output(test_numpy)
    print("NumPy test output:")
    print(output)

    # 3. Test Debug Control
    def test_debug_control():
        debug = DebugTransformer(message="Debug Control Test")

        # Test global control
        DebugTransformer.ENABLE_DEBUG = False
        output1 = capture_output(lambda: debug.transform(df))
        assert output1 == "", "Should not output when debug disabled"

        # Test instance control
        DebugTransformer.ENABLE_DEBUG = True
        debug.turn_off()
        output2 = capture_output(lambda: debug.transform(df))
        assert output2 == "", "Should not output when instance disabled"

    print("\nTesting debug control...")
    test_debug_control()
    print("Debug control tests passed!")

    # 4. Test Different Configurations
    def test_configurations():
        # Test with show_head=False
        debug1 = DebugTransformer(
            message="No Head Test",
            show_head=False
        )
        DebugTransformer.ENABLE_DEBUG = True
        output1 = capture_output(lambda: debug1.transform(df))
        assert "First" not in output1, "Should not show head when show_head=False"

        # Test with different n_rows
        debug2 = DebugTransformer(
            message="Custom Rows Test",
            show_head=True,
            n_rows=3
        )
        output2 = capture_output(lambda: debug2.transform(df))
        assert "First 3 rows" in output2, "Should show correct number of rows"

    print("\nTesting different configurations...")
    test_configurations()
    print("Configuration tests passed!")

    # Reset global debug state
    DebugTransformer.ENABLE_DEBUG = False

    print("\n=== All DebugTransformer Tests Completed Successfully! ===")
In [266]:
# Run the tests
test_debug_transformer()
=== Starting DebugTransformer Tests ===

Testing DataFrame handling...
DataFrame test output:

🪵 Testing DataFrame ---- (D)
Shape: (5, 3)
Columns: ['numeric', 'categorical', 'float_col']
First 2 rows:
   numeric categorical  float_col
0        1           A        1.1
1        2           B        2.2


Testing NumPy array handling...
NumPy test output:

🪵 Testing NumPy Array ---- (D)
Data type: numpy array
Sample row:
[[1 2 3]
 [4 5 6]]


Testing debug control...
Debug control tests passed!

Testing different configurations...
Configuration tests passed!

=== All DebugTransformer Tests Completed Successfully! ===

🐍 Other Helpers (Helpers/Utils)¶

In [267]:
class ModelResultTracker:
    """
    Track and compare results across multiple models
    """

    def __init__(self):
        self.results = []
        self.model_objects = {}

    def add_result(self, model_name, model_type, train_metrics, test_metrics, model_object=None):
        """
        Add model results to tracker

        Args:
            model_name: Name of the model
            model_type: Type (Linear/Tree)
            train_metrics: Training metrics dict
            test_metrics: Testing metrics dict
            model_object: Trained model object (optional)
        """
        result = {
            'Model_Name': model_name,
            'Model_Type': model_type,
            'Train_RMSE': train_metrics['RMSE'],
            'Train_MAE': train_metrics['MAE'],
            'Train_R2': train_metrics['R2'],
            'Train_Adj_R2': train_metrics['Adj_R2'],
            'Train_MAPE': train_metrics['MAPE'],
            'Test_RMSE': test_metrics['RMSE'],
            'Test_MAE': test_metrics['MAE'],
            'Test_R2': test_metrics['R2'],
            'Test_Adj_R2': test_metrics['Adj_R2'],
            'Test_MAPE': test_metrics['MAPE'],
            'Overfitting_Score': train_metrics['R2'] - test_metrics['R2']  # Simple overfitting check
        }

        self.results.append(result)

        if model_object:
            self.model_objects[model_name] = model_object

    def get_results_summary_vertically(self):
        """
        Get summary of all model results with:
        - Metrics as rows
        - Models as columns
        """
        if not self.results:
            print("No results to display!")
            return None

        # Create initial DataFrame
        df_results = pd.DataFrame(self.results)

        # Set Model_Name as index before transposing
        df_results = df_results.set_index('Model_Name')

        # Drop Model_Type as it will look odd in metrics
        df_results = df_results.drop('Model_Type', axis=1)

        # Transpose and sort index for better organization
        df_results = df_results.T

        # Optional: Sort index to group metrics together
        # Create a custom sorter for metrics
        metric_order = [
            'Train_R2', 'Test_R2',
            'Train_Adj_R2', 'Test_Adj_R2',
            'Train_RMSE', 'Test_RMSE',
            'Train_MAE', 'Test_MAE',
            'Train_MAPE', 'Test_MAPE',
            'Overfitting_Score'
        ]

        df_results = df_results.reindex(metric_order)

        return df_results

    def get_results_summary_horizontally(self):
        """Get summary of all model results"""
        if not self.results:
            print("No results to display!")
            return None

        df_results = pd.DataFrame(self.results)
        return df_results

    def plot_model_comparison(self, metric='Test_RMSE'):
        """
        Plot model comparison for specified metric
        """
        if not self.results:
            print("No results to plot!")
            return

        df_results = pd.DataFrame(self.results)

        plt.figure(figsize=(12, 6))

        # Bar plot for model comparison
        colors = ['skyblue' if 'Regression' in model_type else 'lightcoral'
                 for model_type in df_results['Model_Type']]

        bars = plt.bar(df_results['Model_Name'], df_results[metric], color=colors)
        plt.xlabel('Model')
        plt.ylabel(metric)
        plt.title(f'Model Comparison - {metric}')
        plt.xticks(rotation=45, ha='right')

        # Add value labels on bars
        for bar, value in zip(bars, df_results[metric]):
            plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
                    f'{value:.2f}', ha='center', va='bottom')

        # Add legend
        from matplotlib.patches import Patch
        legend_elements = [Patch(facecolor='skyblue', label='Linear Model'),
                          Patch(facecolor='lightcoral', label='Tree Model')]
        plt.legend(handles=legend_elements)

        plt.tight_layout()
        plt.show()

    def get_best_model(self, metric='Test_R2', higher_better=True):
        """
        Get best performing model based on specified metric
        """
        if not self.results:
            print("No results available!")
            return None

        df_results = pd.DataFrame(self.results)

        if higher_better:
            best_idx = df_results[metric].idxmax()
        else:
            best_idx = df_results[metric].idxmin()

        best_model_name = df_results.loc[best_idx, 'Model_Name']
        best_score = df_results.loc[best_idx, metric]

        print(f"🏆 BEST MODEL: {best_model_name}")
        print(f"📊 {metric}: {best_score:.4f}")

        return best_model_name, best_score
In [268]:
def consolidate_categories(X):
    """Function to group rare labels in Categorical Column
    Currently:
    - Product_Type

    NOTE: Since this not needed fit(), using as a function directly (ie Static Logic)
    """
    X_copy = X.copy()
    rare_categories = ['Breads', 'Hard Drinks', 'Starchy Foods', 'Breakfast', 'Seafood']
    X_copy['Product_Type'] = X_copy['Product_Type'].replace(rare_categories, 'Others')
    return X_copy

# # In linear model pipeline only:
# FunctionTransformer(consolidate_categories)

🧠 REMEMBER¶

Pipeline Facade Structures (general idea)

Pipeline(steps=[
    ('preprocessor', ColumnTransformer([
        # Encoding / scaling per column group
    ])),
    ('model', LinearRegression or Ridge or ... )
])
Pipeline(steps=[
    ('feature_engineering', CustomTransformer()),  # optional
    ('preprocessing', ColumnTransformer([...]),    # encoding + scaling
    ('feature_selection', SelectKBest(...)),       # optional
    ('model', Ridge())                             # or CatBoost, etc.
])
Pipeline([
    ('preprocessing', ColumnTransformer([
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(), categorical_cols)
    ])),
    ('model', LinearRegression())
])

For Tree based models

Pipeline([
    ('model', CatBoostRegressor(...))
])

NOTE

ColumnTransformer is for:

  • Parallel column-specific transformations
  • Different operations on different columns simultaneously
  • Each column can only appear in ONE transformer specification

Train Test Split¶

Note: Since target is continuous, using simple random split instead of stratification.

In [269]:
def create_train_test_split(df, target_col='Product_Store_Sales_Total', test_size=0.2, random_state=42):
    """
    Create train-test split with proper separation

    Args:
        df: Full dataset
        target_col: Target variable name
        test_size: Proportion for test set
        random_state: For reproducibility

    Returns:
        X_train, X_test, y_train, y_test
    """

    # Separate features and target
    X = df.drop(columns=[target_col])
    y = df[target_col]

    print("📊 TRAIN-TEST SPLIT SUMMARY")
    print(f"Total samples: {len(df):,}")
    print(f"Features: {X.shape[1]}")
    print(f"Target: {target_col}")
    print(f"Split ratio: {int((1-test_size)*100)}/{int(test_size*100)}")

    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y,
        test_size=test_size,
        random_state=random_state,
        shuffle=True
    )

    print("\n📈 SPLIT RESULTS")
    print(f"Training samples: {len(X_train):,} ({len(X_train)/len(df)*100:.1f}%)")
    print(f"Test samples: {len(X_test):,} ({len(X_test)/len(df)*100:.1f}%)")

    # Target distribution check
    print("\n🎯 TARGET DISTRIBUTION")
    print(f"Train - Mean: {y_train.mean():.0f}, Std: {y_train.std():.0f}")
    print(f"Test  - Mean: {y_test.mean():.0f}, Std: {y_test.std():.0f}")

    return X_train, X_test, y_train, y_test
In [270]:
# Create split
X_train, X_test, y_train, y_test = create_train_test_split(df)
📊 TRAIN-TEST SPLIT SUMMARY
Total samples: 8,763
Features: 11
Target: Product_Store_Sales_Total
Split ratio: 80/20

📈 SPLIT RESULTS
Training samples: 7,010 (80.0%)
Test samples: 1,753 (20.0%)

🎯 TARGET DISTRIBUTION
Train - Mean: 3465, Std: 1065
Test  - Mean: 3461, Std: 1068

🧠 REMEMBER¶

Residual plots in regression are equally helpful as accuracy/loss plots in classification, but they serve slightly different diagnostic purposes:

Residual plots: Diagnose model assumptions (linearity, homoscedasticity, normality) and reveal systematic error patterns

Accuracy/Loss plots: Diagnose training dynamics (overfitting, convergence, learning progression)

Both are essential for model validation - residual plots focus more on model assumptions and error structure, while accuracy/loss plots focus more on training behavior and generalization.

Multiple metrics & their perspectives:

  • RMSE: Penalizes large errors more (sensitive to outliers).
  • MAE: Gives average absolute error (robust baseline).
  • R²: Explains variance captured by the model (fit quality).
  • Adj R²: Adjusts R² for number of features (prevents overfitting).
  • MAPE: Expresses error in percentage terms (interpretable for business).

They complement each other - no single metric tells the full story.

QQ Plot of Residuals

  • Normality of residuals: The plot visually checks if residuals follow a normal distribution — points should lie roughly along the diagonal.
  • It supports model diagnostics, especially for:
    • Validity of inference (t-tests, F-tests).
    • Identifying skewness or kurtosis issues.

Evaluation/Gauge of Model

  • “Train” ≠ whole train dataset.
  • It's a shorthand for “cross-validation score on training folds”.
  • No need to separately validate again if GridSearch is used properly.
  • So, said "check on train", indirectly checking on cross-validated train subsets, not the untouched train dataset.

⚡ Train | Test | Validation

GridSearchCV does internal cross-validation on training data for hyperparameter selection, but this is different from your holdout test set.

Three distinct purposes:

  • Train metrics: Fitting performance
  • CV validation (inside GridSearchCV): Hyperparameter tuning
  • Test metrics: Final unseen data evaluation

The CV validation in GridSearchCV is for model selection, not final model assessment. You still need that clean test set for unbiased final evaluation.

Moreover, the model never sees 100% of the training data at once during training and tuning

Train metrics = model's performance on data it has now fully seen, while CV metrics = average validation performance across folds during tuning.

Train metrics purpose: Diagnose overfitting/underfitting

  • Train >> Test: Overfitting (memorizing training data)
  • Train ≈ Test: Good generalization
  • Both poor: Underfitting (model too simple)

Train metrics alone are meaningless - their value comes from comparison with test/validation metrics to understand model behavior.

🐍 Code Utils (Performance Helpers)¶

In [271]:
class ModelPerformanceAnalyzer:
    """
    Comprehensive model performance analysis for regression tasks
    """

    @staticmethod
    def adj_r2_score(predictors, targets, predictions):
        """Calculate adjusted R-squared"""
        r2 = r2_score(targets, predictions)
        n = predictors.shape[0]
        k = predictors.shape[1]
        return 1 - ((1 - r2) * (n - 1) / (n - k - 1))

    @staticmethod
    def calculate_metrics(model, X, y, dataset_name="Dataset"):
        """
        Calculate comprehensive regression metrics

        Args:
            model: Trained model
            X: Features
            y: Actual targets
            dataset_name: Name for identification

        Returns:
            Dictionary of metrics
        """
        predictions = model.predict(X)

        metrics = {
            'Dataset': dataset_name,
            'RMSE': mean_squared_error(y, predictions, squared=False),
            'MAE': mean_absolute_error(y, predictions),
            'R2': r2_score(y, predictions),
            'Adj_R2': ModelPerformanceAnalyzer.adj_r2_score(X, y, predictions),
            'MAPE': mean_absolute_percentage_error(y, predictions) * 100,  # Convert to percentage
            'Samples': len(y)
        }

        return metrics

    @staticmethod
    def analyze_residuals(model, X, y, title="Residual Analysis"):
        """
        Create comprehensive residual analysis plots
        """
        predictions = model.predict(X)
        residuals = y - predictions

        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
        fig.suptitle(f'{title}', fontsize=16, fontweight='bold')

        # 1. Predicted vs Actual
        axes[0, 0].scatter(y, predictions, alpha=0.6, color='blue')
        axes[0, 0].plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=2)
        axes[0, 0].set_xlabel('Actual Values')
        axes[0, 0].set_ylabel('Predicted Values')
        axes[0, 0].set_title('Predicted vs Actual')

        # Add R² on plot
        r2 = r2_score(y, predictions)
        axes[0, 0].text(0.05, 0.95, f'R² = {r2:.3f}', transform=axes[0, 0].transAxes,
                       bbox=dict(boxstyle="round", facecolor='white'), fontsize=12)

        # 2. Residuals vs Predicted
        axes[0, 1].scatter(predictions, residuals, alpha=0.6, color='green')
        axes[0, 1].axhline(y=0, color='r', linestyle='--')
        axes[0, 1].set_xlabel('Predicted Values')
        axes[0, 1].set_ylabel('Residuals')
        axes[0, 1].set_title('Residuals vs Predicted')

        # 3. Residuals Distribution
        axes[1, 0].hist(residuals, bins=30, alpha=0.7, color='orange', edgecolor='black')
        axes[1, 0].set_xlabel('Residuals')
        axes[1, 0].set_ylabel('Frequency')
        axes[1, 0].set_title('Residual Distribution')

        # Add normality test
        _, p_value = stats.normaltest(residuals)
        axes[1, 0].text(0.05, 0.95, f'Normality p-value: {p_value:.3f}',
                       transform=axes[1, 0].transAxes,
                       bbox=dict(boxstyle="round", facecolor='white'), fontsize=10)

        # 4. Q-Q Plot
        stats.probplot(residuals, dist="norm", plot=axes[1, 1])
        axes[1, 1].set_title('Q-Q Plot (Normality Check)')

        plt.tight_layout()
        plt.show()

        return predictions, residuals
In [272]:
def evaluate_model_comprehensive(model, X_train, y_train, X_test, y_test, model_name, model_type, tracker):
    """
    Comprehensive model evaluation workflow
    """
    analyzer = ModelPerformanceAnalyzer()

    # Calculate metrics
    train_metrics = analyzer.calculate_metrics(model, X_train, y_train, "Train")
    test_metrics = analyzer.calculate_metrics(model, X_test, y_test, "Test")

    # Display metrics
    print(f"\n📊 {model_name} PERFORMANCE")
    print("=" * 50)
    print("Training Metrics:")
    for key, value in train_metrics.items():
        if key != 'Dataset':
            print(f"  {key}: {value:.4f}")

    print("\nTesting Metrics:")
    for key, value in test_metrics.items():
        if key != 'Dataset':
            print(f"  {key}: {value:.4f}")

    # Residual analysis
    analyzer.analyze_residuals(model, X_test, y_test, f"{model_name} - Test Set")

    # Add to tracker
    tracker.add_result(model_name, model_type, train_metrics, test_metrics, model)

    return train_metrics, test_metrics
In [273]:
# Metric Selection Rationale
def get_metric_rationale():
    """
    Provide rationale for metric selection
    """
    rationale = """
    📊 METRIC SELECTION RATIONALE FOR SUPERKART SALES FORECASTING:

    🎯 PRIMARY METRIC: RMSE (Root Mean Square Error)
    • Penalizes large prediction errors more heavily
    • Important for business: Large forecast errors = inventory problems
    • Scale-interpretable: Same units as target (₹)

    📈 SECONDARY METRICS:

    MAE (Mean Absolute Error):
    • Business-friendly interpretation: Average error in rupees
    • Less sensitive to outliers than RMSE
    • Easy stakeholder communication

    R² (R-squared):
    • Percentage of variance explained by model
    • Model explanatory power indicator
    • Benchmarking against simple baseline

    Adjusted R²:
    • Accounts for number of features
    • Prevents overfitting through feature count penalty
    • Better for model selection with different feature sets

    MAPE (Mean Absolute Percentage Error):
    • Percentage-based error measurement
    • Scale-independent comparison
    • Business KPI compatibility

    🎯 BUSINESS IMPACT:
    Lower RMSE = Better inventory planning = Reduced stockouts & overstock costs
    """
    print(rationale)
In [274]:
get_metric_rationale()
    📊 METRIC SELECTION RATIONALE FOR SUPERKART SALES FORECASTING:

    🎯 PRIMARY METRIC: RMSE (Root Mean Square Error)
    • Penalizes large prediction errors more heavily
    • Important for business: Large forecast errors = inventory problems
    • Scale-interpretable: Same units as target (₹)

    📈 SECONDARY METRICS:

    MAE (Mean Absolute Error):
    • Business-friendly interpretation: Average error in rupees
    • Less sensitive to outliers than RMSE
    • Easy stakeholder communication

    R² (R-squared):
    • Percentage of variance explained by model
    • Model explanatory power indicator
    • Benchmarking against simple baseline

    Adjusted R²:
    • Accounts for number of features
    • Prevents overfitting through feature count penalty
    • Better for model selection with different feature sets

    MAPE (Mean Absolute Percentage Error):
    • Percentage-based error measurement
    • Scale-independent comparison
    • Business KPI compatibility

    🎯 BUSINESS IMPACT:
    Lower RMSE = Better inventory planning = Reduced stockouts & overstock costs
    

Modeling and Pipelines | (Preprocessing)¶

Model Selection Notion: Linear Regression + CatBoost

Linear Regression:

  • Strong linear patterns detected in EDA (Product_MRP correlation: 0.788)
  • Maximum interpretability for business stakeholders
  • Fast, stable, production-friendly
  • Direct coefficient insights for actionable decisions

CatBoost:

  • Handles categorical features natively (minimal preprocessing)
  • Top performance on tabular data
  • Built-in overfitting protection
  • Captures non-linear patterns Linear Regression might miss

Why This Combination:

  • Algorithmic diversity: Linear vs Gradient Boosting approaches
  • Preprocessing contrast: Full pipeline vs minimal processing
  • Business balance: Interpretability (LR) + Performance (CatBoost)
  • Validation approach: Tests if relationships are truly linear or require complex interactions

Result: Covers both ends of the ML spectrum - simple/interpretable vs complex/high-performance - ensuring robust model selection regardless of data complexity.

  • Linear Regression - for interpretability and baseline.
  • CatBoost - for performance.

Though we can consider RandomForest (we can take it as extension or homework later | ie enhancements)

NOTE:

💡 Linear models need scaling, encoding, transformations, and outlier handling, while RF/CatBoost are more robust to raw data and handle categoricals natively.

Code Utils¶

Common helpers for Modeling

In [275]:
analyzer = ModelPerformanceAnalyzer()
tracker = ModelResultTracker()
In [276]:
def get_feature_names_from_pipeline(model):
    '''
    **Model Feature Tracing depends on the algorithm:**

    - CatBoost: `feature_names_`
    - XGBoost: `feature_names`
    - LightGBM: `feature_name_`
    - sklearn Pipelines: through `get_feature_names_out()`

    **No direct feature names:**
    - Basic sklearn models (LinearRegression, RandomForest, etc.)
    - Need to track features externally or through pipeline
    - Only have coefficients/importance without column names

    So feature tracing varies by library/implementation - not universal across all models!
    '''
    try:
        # If it's a pipeline
        if hasattr(model, 'named_steps'):
            preprocessor = model.named_steps["preprocessing"]
            return preprocessor.get_feature_names_out()
        # If it's a CatBoost model
        elif hasattr(model, 'feature_names_'):
            return model.feature_names_
        # If it's another type of model
        else:
            return None
    except Exception as e:
        print(f"Could not get feature names: {e}")
        return None
In [277]:
# TODO: rename to pipeline

def quick_model_eval(pipeline, X_train, y_train, X_test, y_test, model_name="Model"):
    """
    Quick sanity check for fitted pipeline

    Args:
        pipeline: Fitted sklearn pipeline
        X_train: Train features
        y_train: Train targets
        X_test: Test features
        y_test: Test targets
        model_name: Name for logging
    """
    print(f"\n🚀 QUICK TEST: {model_name}")
    print("-" * 40)

    # Feature names
    feature_names = get_feature_names_from_pipeline(pipeline)
    if feature_names is not None:
        print(f"Feature names: {feature_names}")

    # # Predict
    # y_pred_test = pipeline.predict(X_test)
    # y_pred_train = pipeline.predict(X_train)


    # Calculate metrics using your existing method
    metrics_test = ModelPerformanceAnalyzer.calculate_metrics(pipeline, X_test, y_test)
    metrics_train = ModelPerformanceAnalyzer.calculate_metrics(pipeline, X_train, y_train)

    # Quick summary
    print("\nTrain Metrics:--- ")
    print(f"RMSE: {metrics_train['RMSE']:.2f}")
    print(f"R2: {metrics_train['R2']:.3f}")
    print(f"MAE: {metrics_train['MAE']:.2f}")
    print("\nTest Metrics:---")
    print(f"RMSE: {metrics_test['RMSE']:.2f}")
    print(f"R2: {metrics_test['R2']:.3f}")
    print(f"MAE: {metrics_test['MAE']:.2f}")

# Usage after fitting:
# linear_results = quick_model_test(linear_model_pipeline, X_test, y_test, "Linear Regression")
# catboost_results = quick_model_test(catboost_pipeline, X_test, y_test, "CatBoost")
In [278]:
def fit_model(model):
    '''
    Fit model managing the logs setup
    '''
    print('\n⚡ Model Training Started ---------\n')
    DebugTransformer.ENABLE_DEBUG = True
    model.fit(X_train, y_train)
    DebugTransformer.ENABLE_DEBUG = False
    print('\n⚡ Model Training Completed ---------')

GuideLines (Expectations)¶

In [279]:
df['Product_Store_Sales_Total'].describe()
Out[279]:
count    8763.000000
mean     3464.003640
std      1065.630494
min        33.000000
25%      2761.715000
50%      3452.340000
75%      4145.165000
max      8000.000000
Name: Product_Store_Sales_Total, dtype: float64

✅ Good Model Performance Guide

With current target stats, a "decent" sales prediction model should ideally have

Metric Ballpark Range
RMSE < 800 (decent), < 600 (good), < 450 (very good)
MAE < 500 preferred
R² > 0.7 is promising

Baseline Comparison

  • Naive baseline (always predict mean): MAE = 866, RMSE = 1,066
  • model should significantly beat this baseline!

Pro tip: Given data's moderate variability (31% CV), achieving 12-15% MAPE would be quite good for a sales prediction model!

M1: Linear Model (Linear Regression)¶

In [280]:
m1_name = "Linear Model"

Pipeline order:

CustomDataPrepper → consolidate_categories → ColumnTransformer

NOTE:

📌 Use validate=False for FunctionTransformer when your function needs pandas DataFrame functionality. Most custom preprocessing functions need this setting.

In [281]:
# Step 1: Define the preprocessing pipeline
linear_preprocessing_pipeline = Pipeline([

    # Step 1: Universal data cleaning & preparation
    ('data_prep', CustomDataPrepper()),
    ('debug1', DebugTransformer("After Data Cleaning")),

    # Step 2: Category consolidation (Product_Type rare grouping)
    ('consolidate_rare', FunctionTransformer(
        consolidate_categories,

        # ?? Working with pandas dataframe, not numpy array, then use validate=False
        validate=False,
        feature_names_out='one-to-one' # Fixes feature_names_out error
    )),
    ('debug2', DebugTransformer("After Category Consolidation")),

    # Step 3: Column-specific transformations
    ('column_transform', ColumnTransformer([

        # One-Hot Encoding (Nominal Categorical)
        ('onehot', OneHotEncoder(
            drop='first',           # Avoid multicollinearity
            handle_unknown='ignore' # Handle unseen categories
        ), ['Product_Sugar_Content', 'Product_Type', 'Store_Type']),

        # Power Transformation (Right-skewed numerical)
        ('power_transform', PowerTransformer(
            method='yeo-johnson',
            standardize=True        # Built-in scaling
        ), ['Product_Allocated_Area']),

        # Ordinal Encoding (Ordinal Categorical)
        ('ordinal', OrdinalEncoder(
            categories=[
                ['Small', 'Medium', 'High'],                  # Store_Size
                ['Tier 3', 'Tier 2', 'Tier 1']                # Store_Location_City_Type
            ]
        ), ['Store_Size', 'Store_Location_City_Type']),

        # Standard Scaling (Normal numerical)
        ('standard_scale', StandardScaler(),
         ['Product_Weight', 'Product_MRP', 'Store_Age']),

        # # MinMax Scaling (Discrete numerical - Store_Age)
        # ('minmax_scale', MinMaxScaler(),
        #  ['Store_Age'])

    ], remainder='drop')), # Drop any unspecified columns
    ('debug3', DebugTransformer("After Preprocessing (ie Column Transformation)")),
])
In [282]:
# Step 4: Complete Linear Regression Pipeline
linear_model_pipeline = Pipeline([
    ('preprocessing', linear_preprocessing_pipeline),
    ('model', LinearRegression())
], verbose=True)
In [283]:
fit_model(linear_model_pipeline)
⚡ Model Training Started ---------


🪵 After Data Cleaning ---- (D)
Shape: (7010, 9)
Columns: ['Product_Weight', 'Product_Sugar_Content', 'Product_Allocated_Area', 'Product_Type', 'Product_MRP', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Store_Age']

🪵 After Category Consolidation ---- (D)
Shape: (7010, 9)
Columns: ['Product_Weight', 'Product_Sugar_Content', 'Product_Allocated_Area', 'Product_Type', 'Product_MRP', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Store_Age']

🪵 After Preprocessing (ie Column Transformation) ---- (D)
Data type: numpy array
[Pipeline] ..... (step 1 of 2) Processing preprocessing, total=   0.0s
[Pipeline] ............. (step 2 of 2) Processing model, total=   0.0s

⚡ Model Training Completed ---------
In [284]:
quick_model_eval(linear_model_pipeline, X_train, y_train, X_test, y_test, model_name="Linear Regression")
🚀 QUICK TEST: Linear Regression
----------------------------------------
Feature names: ['onehot__Product_Sugar_Content_No Sugar'
 'onehot__Product_Sugar_Content_Regular' 'onehot__Product_Type_Canned'
 'onehot__Product_Type_Dairy' 'onehot__Product_Type_Frozen Foods'
 'onehot__Product_Type_Fruits and Vegetables'
 'onehot__Product_Type_Health and Hygiene'
 'onehot__Product_Type_Household' 'onehot__Product_Type_Meat'
 'onehot__Product_Type_Others' 'onehot__Product_Type_Snack Foods'
 'onehot__Product_Type_Soft Drinks' 'onehot__Store_Type_Food Mart'
 'onehot__Store_Type_Supermarket Type1'
 'onehot__Store_Type_Supermarket Type2'
 'power_transform__Product_Allocated_Area' 'ordinal__Store_Size'
 'ordinal__Store_Location_City_Type' 'standard_scale__Product_Weight'
 'standard_scale__Product_MRP' 'standard_scale__Store_Age']

Train Metrics:--- 
RMSE: 449.16
R2: 0.822
MAE: 261.04

Test Metrics:---
RMSE: 435.23
R2: 0.834
MAE: 257.00

Key Highlights 🌟

  1. Strong Predictive Power: R² of 0.834 means your model explains 83.4% of sales variance
  2. Business-Friendly Accuracy: MAE of 257 means average prediction error is only 7.4% of actual sales
  3. No Overfitting: Test performance (R² 0.834) > Train performance (R² 0.822) - excellent sign!
  4. Consistent Performance: RMSE and MAE are well-aligned, indicating good model stability

Business Impact 💼

  • Your predictions are typically within ±257 units of actual sales
  • 83.4% of sales variation is captured by your model
  • This level of accuracy is very suitable for business planning and inventory management

Overall, model generalizes well

Metric Value Interpretation
RMSE 435.23 ✅ Very strong - only ~12.6% of target mean, ~40% of std
MAE 257.00 ✅ Excellent - very low average absolute error
R² 0.834 ✅ High - ~83% of variance explained

Let's track this model

In [285]:
evaluate_model_comprehensive(
    model=linear_model_pipeline,
    X_train=X_train,
    y_train=y_train,
    X_test=X_test,
    y_test=y_test,
    model_name="Linear Regression",
    model_type="Regression",
    tracker=tracker
)
📊 Linear Regression PERFORMANCE
==================================================
Training Metrics:
  RMSE: 449.1570
  MAE: 261.0367
  R2: 0.8221
  Adj_R2: 0.8218
  MAPE: 9.9978
  Samples: 7010.0000

Testing Metrics:
  RMSE: 435.2252
  MAE: 257.0005
  R2: 0.8340
  Adj_R2: 0.8329
  MAPE: 9.5983
  Samples: 1753.0000
Out[285]:
({'Dataset': 'Train',
  'RMSE': 449.15696331939733,
  'MAE': 261.036727532097,
  'R2': 0.8221034286834505,
  'Adj_R2': 0.8218237970337675,
  'MAPE': 9.997835356878639,
  'Samples': 7010},
 {'Dataset': 'Test',
  'RMSE': 435.2251928580389,
  'MAE': 257.0005305191101,
  'R2': 0.8339897902644267,
  'Adj_R2': 0.8329409032413989,
  'MAPE': 9.598348304428047,
  'Samples': 1753})
DEBUGGING¶

NOTE: Below 2 code cells are just meant for debugging to trace issue happening whilst using evaluate_model_comprehensive. It's just debug pipeline to check if everything is perfect in model/pipeline

Pipeline Check

In [286]:
# Check what happens at each pipeline step
print("Original X shape:", X_train.shape)

# Test each step of your pipeline
if hasattr(linear_model_pipeline, 'named_steps'):
    for step_name, step in linear_model_pipeline.named_steps.items():
        print(f"\nStep: {step_name}")
        if hasattr(step, 'transform'):
            try:
                transformed = step.transform(X_train)
                print(f"After {step_name}: {transformed.shape}")
            except:
                print(f"Cannot transform at {step_name}")
Original X shape: (7010, 11)

Step: preprocessing
After preprocessing: (7010, 21)

Step: model
In [287]:
# Test the preprocessing steps individually
preprocessed_X = linear_model_pipeline[:-1].transform(X_train)  # All steps except final model
print("Preprocessed X shape:", preprocessed_X.shape)

# Test final model prediction on preprocessed data
final_model = linear_model_pipeline[-1]
preds = final_model.predict(preprocessed_X)
print("Final predictions shape:", preds.shape)
Preprocessed X shape: (7010, 21)
Final predictions shape: (7010,)

M2: CatBoost (Tree based Model)¶

Gist 🤖¶

For CatBoost, we need much simpler preprocessing.

CatBoost Preprocessing Requirements

✅ NEEDED:

  1. CustomDataPrepper only:
    • Fix 'reg' → 'Regular' in Product_Sugar_Content
    • Transform Store_Establishment_Year → Store_Age (with clamping)
    • Still exclude Store_Id, Product_Id (just identifier)

❌ NOT NEEDED:

  1. No category consolidation - CatBoost handles high cardinality well (16 Product_Type categories is fine)
  2. No scaling - CatBoost is scale-invariant
  3. No encoding - CatBoost handles categoricals natively
  4. No transformations - CatBoost can handle skewed distributions

CatBoost Pipeline Structure

Pipeline([
    ('data_prep', CustomDataPrepper()),  # Only universal cleaning
    ('model', CatBoostRegressor())       # That's it!
])

CatBoost Configuration We'll need to specify which columns are categorical:

  • Product_Sugar_Content
  • Product_Type
  • Store_Size
  • Store_Location_City_Type
  • Store_Type

Key advantage: CatBoost will automatically handle the categorical encoding internally and can capture non-linear patterns that Linear Regression missed.

This should be much faster to implement than the Linear Regression pipeline. Ready to proceed?

In [288]:
m2_name = "CatBoost Model"
In [289]:
# THis is Alternative approach to discard product id and store id column

# NOTE: here the output will be Dataframe

# ?? Commented below class as its not being used for now but if needed then place it in `preprocessing` folder and import it here

# class ColumnDropper(BaseEstimator, TransformerMixin):
#     def __init__(self, columns_to_drop):
#         self.columns_to_drop = columns_to_drop

#     def fit(self, X, y=None):
#         return self

#     def transform(self, X):
#         existing_cols = [col for col in self.columns_to_drop if col in X.columns]
#         return X.drop(columns=existing_cols)

#     def get_feature_names_out(self, input_features=None):
#         # Return input_features minus the dropped ones
#         return [feat for feat in input_features if feat not in self.columns_to_drop]

# Now your pipeline becomes:
catboost_preprocessing_pipeline = Pipeline([

    # Step 1: Universal data cleaning & preparation
    ('data_prep', CustomDataPrepper()),
    ('debug1', DebugTransformer("After Data Cleaning", show_head=True)),

    # # Step 2: Drop irrelevant ID columns
    # ('drop_ids', ColumnDropper(columns_to_drop=['Product_ID', 'Store_ID'])),
    # ('debug2', DebugTransformer("After Dropping ID Columns", show_head=True)),
])
In [290]:
# CatBoost preprocessing pipeline (Alternative)

# NOTE: Here the output will be numpy

# ??Currently not using this as its an alternative to the one above

# catboost_preprocessing_pipeline_v2 = Pipeline([
#     # Step 1: Universal data cleaning & preparation
#     ('data_prep', CustomDataPrepper()),
#     ('debug1', DebugTransformer("After Data Cleaning")),

#     # Step 2: No transformations needed, just column organization
#     ('column_setup', ColumnTransformer([
#         # Pass-through for all features (no transformations)
#         ('passthrough', 'passthrough', [
#             # Numeric Features
#             'Product_Weight',
#             'Product_Allocated_Area',
#             'Product_MRP',
#             'Store_Age',
#             # Categorical Features
#             'Product_Sugar_Content',
#             'Product_Type',
#             'Store_Size',
#             'Store_Location_City_Type',
#             'Store_Type'
#         ])
#     ], remainder='drop')),  # Drop any unspecified columns
#     ('debug2', DebugTransformer("After Column Setup"))
# ])
In [291]:
# Define categorical features for CatBoost
cat_features = [
    'Product_Sugar_Content',
    'Product_Type',
    'Store_Size',
    'Store_Location_City_Type',
    'Store_Type',
]

# Complete CatBoost Pipeline
catboost_model_pipeline = Pipeline([
    ('preprocessing', catboost_preprocessing_pipeline),
    ('model', CatBoostRegressor(
        cat_features=cat_features,
        verbose=False  # Suppress training output
    ))
])
catboost_model_pipeline
Out[291]:
Pipeline(steps=[('preprocessing',
                 Pipeline(steps=[('data_prep', CustomDataPrepper()),
                                 ('debug1',
                                  DebugTransformer(message='After Data '
                                                           'Cleaning',
                                                   show_head=True))])),
                ('model',
                 <catboost.core.CatBoostRegressor object at 0x16b309350>)])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessing',
                 Pipeline(steps=[('data_prep', CustomDataPrepper()),
                                 ('debug1',
                                  DebugTransformer(message='After Data '
                                                           'Cleaning',
                                                   show_head=True))])),
                ('model',
                 <catboost.core.CatBoostRegressor object at 0x16b309350>)])
Pipeline(steps=[('data_prep', CustomDataPrepper()),
                ('debug1',
                 DebugTransformer(message='After Data Cleaning',
                                  show_head=True))])
CustomDataPrepper()
DebugTransformer(message='After Data Cleaning', show_head=True)
<catboost.core.CatBoostRegressor object at 0x16b309350>

📌 NOTE:

Pipeliene for CatBoost will work ensuring:

  1. ColumnDropper (ie last step in preprocessing) ensures the pipeline output remains a DataFrame

    1. Since your CustomDataPrepper and ColumnDropper both operate on DataFrames and return them, CatBoost will receive a DataFrame.
    2. As a result, the column names will remain intact.
  2. CatBoostRegressor can automatically identify categorical columns by name

    1. If cat_features is passed as a list of column names, and the input to CatBoost is a DataFrame, it works perfectly.
In [292]:
fit_model(catboost_model_pipeline)
⚡ Model Training Started ---------


🪵 After Data Cleaning ---- (D)
Shape: (7010, 9)
Columns: ['Product_Weight', 'Product_Sugar_Content', 'Product_Allocated_Area', 'Product_Type', 'Product_MRP', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Store_Age']
First 2 rows:
      Product_Weight Product_Sugar_Content  Product_Allocated_Area  \
5967           13.64              No Sugar                   0.071   
8270           10.03             Low Sugar                   0.016   

      Product_Type  Product_MRP Store_Size Store_Location_City_Type  \
5967     Household       157.10     Medium                   Tier 2   
8270  Frozen Foods       126.66       High                   Tier 2   

             Store_Type  Store_Age  
5967  Supermarket Type2         16  
8270  Supermarket Type1         38  

⚡ Model Training Completed ---------
In [293]:
quick_model_eval(catboost_model_pipeline, X_train, y_train, X_test, y_test, model_name=m2_name)
🚀 QUICK TEST: CatBoost Model
----------------------------------------
Feature names: ['Product_Weight', 'Product_Sugar_Content', 'Product_Allocated_Area', 'Product_Type', 'Product_MRP', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Store_Age']

Train Metrics:--- 
RMSE: 244.31
R2: 0.947
MAE: 89.62

Test Metrics:---
RMSE: 277.21
R2: 0.933
MAE: 103.77

Key Highlights 🌟

  1. Exceptional Predictive Power: R² of 0.933 means your model explains 93.3% of sales variance
  2. High-Precision Accuracy: MAE of 103.77 means average prediction error is only 3% of actual sales
  3. Superior Performance: RMSE and MAE show remarkable improvement over linear regression

Business Impact 💼

  • Predictions are typically within ±104 units of actual sales
  • 93.3% of sales variation is captured by your model
  • This level of accuracy is excellent for business planning and inventory management
  • Generalization is excellent - gap between train/test is small.
  • Significant improvement over linear regression (±257 units → ±104 units)

This model is highly accurate, tight, and stable - likely near optimal.

  • CatBoost crushed it, especially in reducing MAE.
Metric Value Interpretation
RMSE 277.21 ✅ Outstanding — just ~8% of target mean, ~26% of std
MAE 103.77 ✅ Extremely low error — ~3% of mean, very tight predictions
R² 0.933 ✅ Excellent — explains 93.3% of target variance
Metric Old Model CatBoost Improvement
Test RMSE 435.23 277.21 ⬇️ ~36%
Test MAE 257.00 103.77 ⬇️ ~60%
Test R² 0.834 0.933 ⬆️ ~10% absolute gain
In [294]:
evaluate_model_comprehensive(
    model=catboost_model_pipeline,
    X_train=X_train,
    y_train=y_train,
    X_test=X_test,
    y_test=y_test,
    model_name=m2_name,
    model_type="Tree",
    tracker=tracker
)
📊 CatBoost Model PERFORMANCE
==================================================
Training Metrics:
  RMSE: 244.3106
  MAE: 89.6245
  R2: 0.9474
  Adj_R2: 0.9473
  MAPE: 3.4579
  Samples: 7010.0000

Testing Metrics:
  RMSE: 277.2118
  MAE: 103.7653
  R2: 0.9327
  Adj_R2: 0.9322
  MAPE: 3.7795
  Samples: 1753.0000
Out[294]:
({'Dataset': 'Train',
  'RMSE': 244.31064176551922,
  'MAE': 89.62448631091097,
  'R2': 0.9473672486757516,
  'Adj_R2': 0.9472845164287429,
  'MAPE': 3.457860281440354,
  'Samples': 7010},
 {'Dataset': 'Test',
  'RMSE': 277.2117996479401,
  'MAE': 103.76530600486473,
  'R2': 0.9326511522361677,
  'Adj_R2': 0.932225628212387,
  'MAPE': 3.7794830900862766,
  'Samples': 1753})

Model Performance Improvement

🚀 REMEMBER: Randomized vs GridSearch 🧠

Let's compare for our use case:

RandomizedSearchCV Advantages:

  1. ✅ Faster - important as CatBoost is already computationally intensive
  2. ✅ Better for continuous params (learning_rate, l2_leaf_reg)
  3. ✅ Can explore more combinations in same time
  4. ✅ Good for large parameter spaces (our case)

GridSearchCV Disadvantages:

  1. ❌ Computationally expensive - tests every combination
  2. ❌ Curse of dimensionality - with 5+ params, combinations explode
  3. ❌ May waste time on unimportant parameter combinations
  4. ❌ Fixed intervals might miss optimal values

Verdict: RandomizedSearchCV is better for our case:

  • Multiple continuous parameters
  • Limited computational budget
  • Need broad parameter space exploration

M3: CatBoost Tuned (Tuned)¶

In [295]:
m3_name = "CatBoost_Tuned"

For CatBoost regression, key hyperparameters to tune are:

  1. Learning Parameters:

    • learning_rate
    • iterations (n_estimators)
    • depth (max_depth)
  2. Regularization:

    • l2_leaf_reg
    • min_data_in_leaf
  3. Optimization:

    • early_stopping_rounds
    • eval_metric='RMSE' (our metric of choice)

Devised approach:

  1. Use RandomizedSearchCV (faster than GridSearchCV)
  2. Define parameter grid
  3. Use cross-validation
  4. Track best parameters and improvement
In [296]:
# Parameter distribution for random search
param_distributions = {
    'model__iterations': randint(100, 1000),
    'model__learning_rate': uniform(0.01, 0.3),
    'model__depth': randint(4, 10),
    'model__l2_leaf_reg': uniform(1, 10),
    'model__min_data_in_leaf': randint(1, 30)
}

NOTE: ⚠️ Use prefix model__ because CatBoostRegressor is wrapped in the Pipeline under the 'model' step.

In [297]:
# RandomizedSearchCV setup
catboost_rs = RandomizedSearchCV(
    estimator=catboost_model_pipeline,
    param_distributions=param_distributions,
    n_iter=20,  # number of parameter settings sampled
    cv=5,       # 5-fold cross-validation
    scoring='neg_root_mean_squared_error',
    n_jobs=-1,  # use all CPU cores
    verbose=1,
    random_state=42
)
In [298]:
# Fit CatBoost - RandomizedSearchCV
fit_model(catboost_rs)
⚡ Model Training Started ---------

Fitting 5 folds for each of 20 candidates, totalling 100 fits

🪵 After Data Cleaning ---- (D)
Shape: (7010, 9)
Columns: ['Product_Weight', 'Product_Sugar_Content', 'Product_Allocated_Area', 'Product_Type', 'Product_MRP', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Store_Age']
First 2 rows:
      Product_Weight Product_Sugar_Content  Product_Allocated_Area  \
5967           13.64              No Sugar                   0.071   
8270           10.03             Low Sugar                   0.016   

      Product_Type  Product_MRP Store_Size Store_Location_City_Type  \
5967     Household       157.10     Medium                   Tier 2   
8270  Frozen Foods       126.66       High                   Tier 2   

             Store_Type  Store_Age  
5967  Supermarket Type2         16  
8270  Supermarket Type1         38  

⚡ Model Training Completed ---------
In [299]:
# Get best results
print("\nBest parameters:", catboost_rs.best_params_)
print("Best RMSE:", (-catboost_rs.best_score_)**0.5)  # Convert back from negative RMSE
Best parameters: {'model__depth': 7, 'model__iterations': 513, 'model__l2_leaf_reg': 3.1233911067827616, 'model__learning_rate': 0.06454749016213018, 'model__min_data_in_leaf': 21}
Best RMSE: 16.78439446861335

🔍 Why a minus (-) appears?

  • So when using an error metric like RMSE, GridSearchCV/RandomSearchCV maximizes negative RMSE (to treat it as a score).

actual_rmse = -search.bestscore

In [300]:
# For final predictions, use:
catboost_rs_tuned = catboost_rs.best_estimator_

Difference between best_estimator_ and original search instance

Both serve different purposes:

  1. random_search.best_estimator_:

    • Returns the fitted model with best parameters
    • Ready to use for predictions
    • This is what we want for final model
  2. random_search:

    • The RandomizedSearchCV object itself
    • Contains search history, CV results
    • Good for analysis, not for final predictions
# For final predictions, use:
best_model = random_search.best_estimator_
predictions = best_model.predict(X_test)

TIPS: ✅ Always use best_estimator_ for your final model!

Using bestestimator is preferred because:

  • It returns the actual pipeline/model with best params
  • You can then inspect, save, or reuse it like a normal model
In [301]:
quick_model_eval(catboost_rs_tuned, X_train, y_train, X_test, y_test, model_name=m3_name)
🚀 QUICK TEST: CatBoost_Tuned
----------------------------------------
Feature names: ['Product_Weight', 'Product_Sugar_Content', 'Product_Allocated_Area', 'Product_Type', 'Product_MRP', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Store_Age']

Train Metrics:--- 
RMSE: 251.03
R2: 0.944
MAE: 92.85

Test Metrics:---
RMSE: 277.09
R2: 0.933
MAE: 105.17

🧠 Insight:

Tuning didn’t move the needle much, which is often the case when the base model is already near-optimal.

  • Minor fluctuations in MAE/RMSE are likely within noise bounds.

Key Highlights 🌟

  1. Consistent Excellence: R² of 0.933 maintains the 93.3% explained variance
  2. Stable Accuracy: MAE of 105.17 keeps the average prediction error at 3% of actual sales
  3. Improved Training Stability: Train-test gap slightly reduced (RMSE gap: 26.06 vs previous 32.9)
  4. Robust Performance: Metrics remain consistently excellent across all measures

Shows slightly better stability!

Metric Value Interpretation
RMSE 277.09 ✅ Outstanding - ~8% of target mean, ~26% of std
MAE 105.17 ✅ Still excellent - low average error (~3% of mean)
R² 0.933 ✅ Excellent - explains 93.3% of target variance (same as before)
In [302]:
evaluate_model_comprehensive(
    model=catboost_rs_tuned,
    X_train=X_train,
    y_train=y_train,
    X_test=X_test,
    y_test=y_test,
    model_name=m3_name,
    model_type="Tree",
    tracker=tracker
)
📊 CatBoost_Tuned PERFORMANCE
==================================================
Training Metrics:
  RMSE: 251.0255
  MAE: 92.8548
  R2: 0.9444
  Adj_R2: 0.9443
  MAPE: 3.5841
  Samples: 7010.0000

Testing Metrics:
  RMSE: 277.0911
  MAE: 105.1661
  R2: 0.9327
  Adj_R2: 0.9323
  MAPE: 3.8729
  Samples: 1753.0000
Out[302]:
({'Dataset': 'Train',
  'RMSE': 251.0255178978363,
  'MAE': 92.85475093875208,
  'R2': 0.9444342670493849,
  'Adj_R2': 0.9443469245140239,
  'MAPE': 3.584149036976944,
  'Samples': 7010},
 {'Dataset': 'Test',
  'RMSE': 277.09109978117425,
  'MAE': 105.16610850102744,
  'R2': 0.9327097877434043,
  'Adj_R2': 0.9322846341909502,
  'MAPE': 3.8728926827812695,
  'Samples': 1753})

M4: Linear Reg Model (Tuned)¶

Key Parameters to Tune:

  1. Regularization Type:

    1. Lasso (L1)
    2. Ridge (L2)
    3. ElasticNet (L1 + L2)
  2. Regularization Strength:

    1. alpha parameter
    2. Controls overfitting

Ridge is better for our case because:

  • Better for handling correlated features
  • More stable predictions
  • Simpler than ElasticNet
In [303]:
m4_name = 'Linear Model Tuned'
In [304]:
# Update pipeline with Ridge
ridge_model_pipeline = Pipeline([
    ('preprocessing', linear_preprocessing_pipeline),
    ('model', Ridge())
])
In [305]:
# Grid of alpha values
param_grid = {
    'model__alpha': [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
    # 'model__alpha': np.logspace(-3, 3, 20)
}
In [306]:
# GridSearchCV setup
grid_search_ridge = GridSearchCV(
    estimator=ridge_model_pipeline,
    param_grid=param_grid,
    cv=5,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1,
    verbose=1
)
In [307]:
fit_model(grid_search_ridge)
⚡ Model Training Started ---------

Fitting 5 folds for each of 7 candidates, totalling 35 fits

🪵 After Data Cleaning ---- (D)
Shape: (7010, 9)
Columns: ['Product_Weight', 'Product_Sugar_Content', 'Product_Allocated_Area', 'Product_Type', 'Product_MRP', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Store_Age']

🪵 After Category Consolidation ---- (D)
Shape: (7010, 9)
Columns: ['Product_Weight', 'Product_Sugar_Content', 'Product_Allocated_Area', 'Product_Type', 'Product_MRP', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Store_Age']

🪵 After Preprocessing (ie Column Transformation) ---- (D)
Data type: numpy array

⚡ Model Training Completed ---------
In [308]:
# Step 5: Evaluate
print("Best Parameters:", grid_search_ridge.best_params_)
print("Best Score (neg RMSE):", grid_search_ridge.best_score_)
Best Parameters: {'model__alpha': 10.0}
Best Score (neg RMSE): -450.2733562560611
In [309]:
# For final predictions, use:
linear_model_tuned = grid_search_ridge.best_estimator_
In [310]:
quick_model_eval(linear_model_tuned, X_train, y_train, X_test, y_test, model_name=m4_name)
🚀 QUICK TEST: Linear Model Tuned
----------------------------------------
Feature names: ['onehot__Product_Sugar_Content_No Sugar'
 'onehot__Product_Sugar_Content_Regular' 'onehot__Product_Type_Canned'
 'onehot__Product_Type_Dairy' 'onehot__Product_Type_Frozen Foods'
 'onehot__Product_Type_Fruits and Vegetables'
 'onehot__Product_Type_Health and Hygiene'
 'onehot__Product_Type_Household' 'onehot__Product_Type_Meat'
 'onehot__Product_Type_Others' 'onehot__Product_Type_Snack Foods'
 'onehot__Product_Type_Soft Drinks' 'onehot__Store_Type_Food Mart'
 'onehot__Store_Type_Supermarket Type1'
 'onehot__Store_Type_Supermarket Type2'
 'power_transform__Product_Allocated_Area' 'ordinal__Store_Size'
 'ordinal__Store_Location_City_Type' 'standard_scale__Product_Weight'
 'standard_scale__Product_MRP' 'standard_scale__Store_Age']

Train Metrics:--- 
RMSE: 449.14
R2: 0.822
MAE: 259.54

Test Metrics:---
RMSE: 435.14
R2: 0.834
MAE: 255.47

Key Highlights 🌟

  1. performs very well for a linear model
  2. No Overfitting
  3. Marginal Improvements: Slight improvements in MAE and RMSE over base model

Consider using CatBoost for highest accuracy, or this linear model when simplicity/speed is prioritized over maximum precision.

In [311]:
evaluate_model_comprehensive(
    model=linear_model_tuned,
    X_train=X_train,
    y_train=y_train,
    X_test=X_test,
    y_test=y_test,
    model_name=m4_name,
    model_type="Regression",
    tracker=tracker
)
📊 Linear Model Tuned PERFORMANCE
==================================================
Training Metrics:
  RMSE: 449.1358
  MAE: 259.5390
  R2: 0.8221
  Adj_R2: 0.8218
  MAPE: 9.9623
  Samples: 7010.0000

Testing Metrics:
  RMSE: 435.1363
  MAE: 255.4742
  R2: 0.8341
  Adj_R2: 0.8330
  MAPE: 9.5640
  Samples: 1753.0000
Out[311]:
({'Dataset': 'Train',
  'RMSE': 449.1358366713886,
  'MAE': 259.53897853870836,
  'R2': 0.8221201634561986,
  'Adj_R2': 0.8218405581115313,
  'MAPE': 9.962270418096642,
  'Samples': 7010},
 {'Dataset': 'Test',
  'RMSE': 435.13632276038066,
  'MAE': 255.47423065159316,
  'R2': 0.8340575797101025,
  'Adj_R2': 0.8330091209948878,
  'MAPE': 9.564044966201452,
  'Samples': 1753})

Model Performance Comparision¶

In [312]:
# Get Summary
tracker.get_results_summary_vertically()
Out[312]:
Model_Name Linear Regression CatBoost Model CatBoost_Tuned Linear Model Tuned
Train_R2 0.822103 0.947367 0.944434 0.822120
Test_R2 0.833990 0.932651 0.932710 0.834058
Train_Adj_R2 0.821824 0.947285 0.944347 0.821841
Test_Adj_R2 0.832941 0.932226 0.932285 0.833009
Train_RMSE 449.156963 244.310642 251.025518 449.135837
Test_RMSE 435.225193 277.211800 277.091100 435.136323
Train_MAE 261.036728 89.624486 92.854751 259.538979
Test_MAE 257.000531 103.765306 105.166109 255.474231
Train_MAPE 9.997835 3.457860 3.584149 9.962270
Test_MAPE 9.598348 3.779483 3.872893 9.564045
Overfitting_Score -0.011886 0.014716 0.011724 -0.011937

⚡ Insights

  1. CatBoost Clearly Outperforms Linear Models (RMSE: ~435 → ~277, R2: ~0.83 → ~0.933)
  2. CatBoost Tuning Had Marginal Impact
  3. Linear Models Generalize Well But Lack Power
In [313]:
# Plot Comparision
tracker.plot_model_comparison()
In [314]:
# Pick Best Model
best_model_name, best_score = tracker.get_best_model(metric='Test_R2', higher_better=True)
🏆 BEST MODEL: CatBoost_Tuned
📊 Test_R2: 0.9327

🚀 Thus CatBoost model is Selected

In [315]:
best_model = catboost_rs_tuned

Model Serialization¶

Making ready for Production

  • Let make debugging step inactive in model !
In [318]:
type(best_model)
Out[318]:
sklearn.pipeline.Pipeline
In [316]:
best_model.named_steps
Out[316]:
{'preprocessing': Pipeline(steps=[('data_prep', CustomDataPrepper()),
                 ('debug1',
                  DebugTransformer(message='After Data Cleaning',
                                   show_head=True))]),
 'model': <catboost.core.CatBoostRegressor at 0x16af3e690>}
In [319]:
preprocessing_pipeline = best_model.named_steps['preprocessing']
In [320]:
debug_transformer = preprocessing_pipeline.named_steps['debug1']
In [321]:
type(debug_transformer)
Out[321]:
preprocessing.debug_transformer.DebugTransformer
In [322]:
# * For Production debugging is not required
debug_transformer.turn_off()
In [324]:
print('Is Debug Step Active?', best_model.named_steps['preprocessing']['debug1'].active)
Is Debug Step Active? False
In [325]:
class ModelSerializer:
    def __init__(self, base_path='models'):
        self.base_path = Path(base_path)
        self.base_path.mkdir(exist_ok=True)

    def save_model(self, model, model_name, include_timestamp=False):
        """Save model with optional timestamp"""
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") if include_timestamp else ""
        filename = f"{model_name}_{timestamp}.joblib" if include_timestamp else f"{model_name}.joblib"
        path = self.base_path / filename

        joblib.dump(model, path)
        print(f"✅ Model saved: {path}")
        return path

    def load_model(self, model_path):
        """Load model from path"""
        path = Path(model_path)
        if not path.exists():
            raise FileNotFoundError(f"Model not found: {path}")

        model = joblib.load(path)
        print(f"✅ Model loaded: {path}")
        return model
In [326]:
serializer = ModelSerializer()
In [327]:
# Get pipeline steps without debug
# For Debugging Purpose only (Ignore)
clean_steps = [(name, step) for name, step in best_model.named_steps.items()
               if not isinstance(step, DebugTransformer)]
In [336]:
# Now save
serializer.save_model(best_model, "best_model", include_timestamp=False)
✅ Model saved: models/best_model.joblib
Out[336]:
PosixPath('models/best_model.joblib')
In [337]:
def test_saved_model(model_path):
    """
    Test saved model by loading and verifying predictions

    Args:
        model_path: Path to saved model
        X_test: Test features
        y_test: Test targets
    """
    print("\n🔄 Testing Saved Model")
    print("-" * 40)

    # Load model
    loaded_model = serializer.load_model(model_path)

    # Check cat names
    model = loaded_model.named_steps['model']
    cat_features = model.get_params()['cat_features']
    print("Cat Features: ", cat_features)

    # Verify predictions
    print("\n🔍 Performance Verification:")
    metrics = quick_model_eval(
        pipeline=loaded_model,
        X_train=X_train,
        y_train=y_train,
        X_test=X_test,
        y_test=y_test,
        model_name="Loaded Model"
    )

    return loaded_model, metrics
In [338]:
loaded_model, verification_metrics = test_saved_model("models/best_model.joblib")
🔄 Testing Saved Model
----------------------------------------
✅ Model loaded: models/best_model.joblib
Cat Features:  ['Product_Sugar_Content', 'Product_Type', 'Store_Size', 'Store_Location_City_Type', 'Store_Type']

🔍 Performance Verification:

🚀 QUICK TEST: Loaded Model
----------------------------------------
Feature names: ['Product_Weight', 'Product_Sugar_Content', 'Product_Allocated_Area', 'Product_Type', 'Product_MRP', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Store_Age']

Train Metrics:--- 
RMSE: 251.03
R2: 0.944
MAE: 92.85

Test Metrics:---
RMSE: 277.09
R2: 0.933
MAE: 105.17
In [339]:
loaded_model
Out[339]:
Pipeline(steps=[('preprocessing',
                 Pipeline(steps=[('data_prep', CustomDataPrepper()),
                                 ('debug1',
                                  DebugTransformer(active=False,
                                                   message='After Data '
                                                           'Cleaning',
                                                   show_head=True))])),
                ('model',
                 <catboost.core.CatBoostRegressor object at 0x16b399bd0>)])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessing',
                 Pipeline(steps=[('data_prep', CustomDataPrepper()),
                                 ('debug1',
                                  DebugTransformer(active=False,
                                                   message='After Data '
                                                           'Cleaning',
                                                   show_head=True))])),
                ('model',
                 <catboost.core.CatBoostRegressor object at 0x16b399bd0>)])
Pipeline(steps=[('data_prep', CustomDataPrepper()),
                ('debug1',
                 DebugTransformer(active=False, message='After Data Cleaning',
                                  show_head=True))])
CustomDataPrepper()
DebugTransformer(active=False, message='After Data Cleaning', show_head=True)
<catboost.core.CatBoostRegressor object at 0x16b399bd0>
In [365]:
# Custom Data Checker

def test_inference(model):
    # Test data
    # NOTE: !
    # - though we do not provide Store_Age, at correct place compare to training data
    # - But it will be handled by model CustomDataPrepper as it will unify column order
    #
    query_data = {
        'Product_Weight': 100,
        'Product_MRP': 50,
        'Product_Allocated_Area': 200,
        #'Store_Age': 5,
        'Store_Establishment_Year': 2020,
        'Product_Sugar_Content': 'Regular',
        'Product_Type': 'Dairy',
        'Store_Size': 'Medium',
        'Store_Location_City_Type': 'Tier 1',
        'Store_Type': 'Supermarket Type1'
    }
    dft = pd.DataFrame([query_data])
    return model.predict(dft)
In [366]:
test_inference(loaded_model)
Out[366]:
array([4192.07139122])

⚡ Woohoo.. It worked !!

Validated ✅

NOTE:

Since we trained our CatBoost model directly on the raw sales values (Product_Store_Sales_Total), we don't need any inverse transformations like np.exp().

Our model:

  • ✅ Predicts sales values directly
  • ✅ Output is in original scale ($)
  • ✅ No log/scaling transformations to reverse
  • ✅ Can use predictions as-is

Unlike some cases where you might:

  • Log-transform target (need exp)
  • Scale target (need inverse_transform)
  • Use different units (need conversion)

So we can use the model's predictions directly in our API responses.

🧠 REMEMBER (Serialize | Custom Class)¶

When you serialize an object (like an ML model) with joblib or pickle, it records the fully qualified name of any custom classes used - i.e <module_path>.<class_name>

If this class is not found in the same module path during deserialization, you get errors like:

AttributeError: Can't get attribute 'MyCustomClass' on <module '__main__'>
In [331]:
def print_class_location(cls):
    # From object we can get the class via __class__
    print(f"{cls.__module__}.{cls.__qualname__}")
In [333]:
print_class_location(DebugTransformer)
preprocessing.debug_transformer.DebugTransformer
In [335]:
print_class_location(CustomDataPrepper)
preprocessing.custom_data_prepper.CustomDataPrepper


Phase 2 MLOPS - (Deployment)¶

IMP Points¶

🔐 NOTE on Access

  • Both Spaces should be Public or both should be Private, otherwise you’ll get CORS or auth errors.

  • If Private, make sure you're logged in when testing frontend (or use Hugging Face tokens).

BackEnd (Infra)¶

Setup

  1. Go to hugging face spaces
  2. Create New Space
  3. Add Space Name (superkart-backend)
  4. SDK: choose Docker
  5. Create New Space

https://huggingface.co/spaces/nipunshah/superkart-backend

Access Token (We'll need this later):

  1. Go to Settings → Access Tokens
  2. Create new token with "write" role
  3. Save token securely (we'll use it for file uploads)

NOTE: This needed only whilst using hyggingface-api (ie python package) to upload files and folders from local to spaces, If you use github or manual upload it's not needed

TIPS (Cells | Notebooks)

  • You can use %%writefile in a Jupyter notebook cell to create and write to files directly from the notebook. This is perfect for reproducibility and submission!

High Level Backend Folder Structure Idea

backend_files/
├── app/
│   ├── __init__.py
│   ├── main.py
│   ├── routes.py
│   ├── model_loader.py
│   ├── predict.py
├── models/
│   └── best_model.joblib     # (you'll copy this manually)
├── preprocessing/
│   ├── custom_data_prepper.py
│   ├── debug_transformer.py
│   ├── __init__.py
├── requirements.txt
├── Dockerfile
└── README.md                 # (optional)

Create Folder (Only Once)

In [345]:
# Create main backend directory
os.makedirs("backend", exist_ok=True)
# Create app and models subdirectories
os.makedirs("backend/app", exist_ok=True)
os.makedirs("backend/models", exist_ok=True)
# Helpers
# Create preprocessing folder
os.makedirs("backend/preprocessing", exist_ok=True)
In [346]:
# Copy all files from preprocessing to backend/preprocessing
!cp preprocessing/* backend/preprocessing/
cp: preprocessing/__pycache__ is a directory (not copied).
In [347]:
# Copy model to backend
!cp models/best_model.joblib backend/models/

App¶

NOTE:

  • Relative import is necessary to avoid any path issues
1. Model Loader¶
In [ ]:
# %%writefile backend/app/model_loader.py
# import joblib
# import os
# from typing import Optional

# # Singleton pattern for model
# _model = None

# def load_model(path: str = "models/best_model.joblib") -> Optional[object]:
#     """
#     Load model using singleton pattern to avoid multiple loads

#     Args:
#         path: Path to saved model file

#     Returns:
#         Loaded model object

#     Raises:
#         FileNotFoundError: If model file doesn't exist
#     """
#     global _model

#     if _model is None:
#         base_dir = os.path.dirname(__file__)
#         model_path_rel = os.path.join(base_dir, '..', path) # path/to/backend
#         # ! note this is optional step model_path_rel also works
#         # this just converts to absolute path (thats it)
#         model_path = os.path.abspath(model_path_rel)
#         if not os.path.exists(model_path):
#             raise FileNotFoundError(f"Model not found at {model_path}")
#         try:
#             _model = joblib.load(model_path)
#             print(f"✅ Model loaded successfully from {model_path}")
#         except Exception as e:
#             raise Exception(f"Error loading model: {str(e)}")

#     return _model
In [350]:
%%writefile backend/app/model_loader.py
#import __main__
import joblib
import os
from typing import Optional
from preprocessing import CustomDataPrepper, DebugTransformer
#from ..preprocessing import CustomDataPrepper, DebugTransformer

# Using relative imports is better practice when dealing with modules within the same package

# ! Below is hack used earlier when Preprocessing Helper class were not extracted to separate package !
# ---
# Register it in __main__ so joblib/pickle can resolve it
#__main__.CustomDataPrepper = CustomDataPrepper
#__main__.DebugTransformer = DebugTransformer
# ---

# Class based approach to load model (setup)

class ModelLoader:
    """Singleton class for loading and managing ML model"""
    _instance = None
    _model = None

    def __new__(cls):
        if cls._instance is None:
            cls._instance = super(ModelLoader, cls).__new__(cls)
        return cls._instance

    def get_model(self, path: str = "models/best_model.joblib") -> Optional[object]:
        """Get model, loading it if not already loaded"""
        if self._model is None:
            print('Model not loaded yet, loading ...')
            base_dir = os.path.dirname(__file__)
            model_path_rel = os.path.join(base_dir, '..', path) # path/to/backend
            # ! note this is optional step model_path_rel also works
            # this just converts to absolute path (thats it)
            model_path = os.path.abspath(model_path_rel)
            if not os.path.exists(model_path):
                raise FileNotFoundError(f"Model not found at {model_path}")
            try:
                self._model = joblib.load(model_path)
                print(f"✅ Model loaded successfully from {model_path}")
            except Exception as e:
                raise Exception(f"Error loading model: {str(e)}")

        return self._model
Overwriting backend/app/model_loader.py
2. Predict Business Logic¶
In [351]:
%%writefile backend/app/predict.py
import pandas as pd
from typing import Dict, List, Union
from .model_loader import ModelLoader

# Pure Prediction Logic

class SalesPredictor:
    """Handle sales prediction logic and data processing"""

    def __init__(self):
        self.model = ModelLoader().get_model()

    def predict_single(self, data: Dict) -> Dict:
        """
        Predict sales for single store-product

        Args:
            data: Dictionary with feature values

        Returns:
            Dictionary with prediction and metadata
        """
        try:
            # Convert single input to DataFrame
            df = pd.DataFrame([data])

            # Make prediction
            prediction = self.model.predict(df)[0]

            return {
                "status": "success",
                "predicted_sales": float(prediction),
                "message": "Prediction successful"
            }

        except Exception as e:
            return {
                "status": "error",
                "message": f"Prediction failed: {str(e)}"
            }

    def predict_batch(self, df: pd.DataFrame) -> Dict:
        """
        Predict sales for multiple store-products

        Args:
            df: DataFrame with feature values

        Returns:
            Dictionary with predictions and metadata
        """
        try:
            # Make predictions
            predictions = self.model.predict(df)

            return {
                "status": "success",
                "predictions": predictions.tolist(),
                "message": f"Successfully predicted {len(predictions)} samples"
            }

        except Exception as e:
            return {
                "status": "error",
                "message": f"Batch prediction failed: {str(e)}"
            }
Writing backend/app/predict.py
3. Routes configuration/mapping¶
In [372]:
%%writefile backend/app/routes.py
from flask import Blueprint, request, jsonify
from .predict import SalesPredictor
import pandas as pd

# Create Blueprint for prediction routes
router = Blueprint('predictions', __name__)

# Required features for prediction
REQUIRED_FEATURES = [
    'Product_Weight', 'Product_MRP', 'Product_Allocated_Area',
    'Store_Establishment_Year', 'Product_Sugar_Content', 'Product_Type',
    'Store_Size', 'Store_Location_City_Type', 'Store_Type'
]

# Initialize predictor
predictor = SalesPredictor()

@router.get('/')
def home():
    return "Welcome to SuperKart Sales Prediction API!"

@router.post('/predict')
def predict_sales():
    """Single prediction endpoint"""
    try:
        # Get JSON data
        data = request.get_json()

        # Validate required features
        missing_features = [feat for feat in REQUIRED_FEATURES if feat not in data]
        if missing_features:
            return jsonify({
                "status": "error",
                "message": f"Missing required features: {missing_features}"
            }), 400

        # Create DataFrame with required features
        input_data = {feature: data[feature] for feature in REQUIRED_FEATURES}

        # Get prediction
        result = predictor.predict_single(input_data)

        return jsonify(result), 200 if result["status"] == "success" else 400

    except Exception as e:
        return jsonify({
            "status": "error",
            "message": f"Prediction failed: {str(e)}"
        }), 500

@router.post('/predict_batch')
def predict_sales_batch():
    """Batch prediction endpoint"""
    try:
        # Check if file was uploaded
        if 'file' not in request.files:
            return jsonify({
                "status": "error",
                "message": "No file uploaded"
            }), 400

        file = request.files['file']

        # Read CSV file
        df = pd.read_csv(file)

        # Validate required features
        missing_features = [feat for feat in REQUIRED_FEATURES if feat not in df.columns]
        if missing_features:
            return jsonify({
                "status": "error",
                "message": f"Missing required features in CSV: {missing_features}"
            }), 400

        # Select only required features
        input_df = df[REQUIRED_FEATURES]

        # Get predictions
        result = predictor.predict_batch(input_df)

        return jsonify(result), 200 if result["status"] == "success" else 400

    except Exception as e:
        return jsonify({
            "status": "error",
            "message": f"Batch prediction failed: {str(e)}"
        }), 500
Overwriting backend/app/routes.py
4. main.py (Entry Point)¶
In [353]:
%%writefile backend/app/main.py
from flask import Flask
from .routes import router  # Import Blueprint directly

def create_app():
    """Create and configure Flask application"""

    # NOTE: Flask(__name__) is the recommended way because
    # 1. Package Resolution
    # 2. Custom string might break package resolution

    app = Flask(__name__)  # Instead of Flask("SuperKart Sales Predictor")

    # Configure app settings
    app.config.update(
        JSON_SORT_KEYS=False,  # Preserve JSON response order
        MAX_CONTENT_LENGTH=16 * 1024 * 1024  # 16MB max file size for batch predictions
    )

    # Register routes blueprint
    app.register_blueprint(router)

    return app

# Create app instance
app = create_app()

if __name__ == '__main__':
    # For local development
    # app.run(debug=True)

    # For HuggingFace deployment
    app.run(host="0.0.0.0", port=7860, debug=True)
Writing backend/app/main.py

This:

  1. Uses file to get current file location
  2. Navigates up one level (..)
  3. Joins with the model path
  4. Converts to absolute path, resolving any . or ..

Result:

  • Always gets absolute path
  • Resolves any symbolic links
  • Works across different OS
  • Independent of where code is run from
5. Package Protocol¶
In [354]:
%%writefile backend/app/__init__.py
# Future exports when needed:
# from .main import app
# from .routes import router
# from .predict import SalesPredictor
# from .model_loader import ModelLoader

# __all__ = ['app', 'router', 'SalesPredictor', 'ModelLoader']
Writing backend/app/__init__.py

Requirements (ie Dependencies Listing)

In [417]:
# let's check our current environment
!pip freeze > current_requirements.txt
# !cat current_requirements.txt
In [355]:
!pip list | grep -i "pandas\|numpy\|scikit\|catboost\|joblib\|flask\|gunicorn\|requests"
catboost                      1.2.8
Flask                         2.2.2
gunicorn                      23.0.0
joblib                        1.2.0
numpy                         1.24.3
numpydoc                      1.5.0
pandas                        2.3.0
requests                      2.31.0
requests-file                 1.5.1
requests-toolbelt             1.0.0
scikit-image                  0.20.0
scikit-learn                  1.2.2
In [356]:
import sklearn as sk
import catboost
import flask
import gunicorn
import requests
# Let see current versions
print("Current versions in use:")
print(f"pandas: {pd.__version__}")
print(f"numpy: {np.__version__}")
print(f"scikit-learn: {sk.__version__}")
print(f"catboost: {catboost.__version__}")
print(f"joblib: {joblib.__version__}")
print(f"flask: {flask.__version__}")
print(f"gunicorn: {gunicorn.__version__}")
print(f"requests: {requests.__version__}")
Current versions in use:
pandas: 2.3.0
numpy: 1.24.3
scikit-learn: 1.2.2
catboost: 1.2.8
joblib: 1.2.0
flask: 2.2.2
gunicorn: 23.0.0
requests: 2.31.0
In [357]:
%%writefile backend/requirements.txt
# ML & Data Processing
pandas==2.3.0
numpy==1.24.3
scikit-learn==1.2.2
catboost==1.2.8
joblib==1.2.0

# API & Server
# flask==2.2.2
flask>=2.0,<2.3
werkzeug<3.0.0
gunicorn==23.0.0

# Utilities
requests==2.31.0
Writing backend/requirements.txt
In [360]:
%%writefile backend/Dockerfile
FROM python:3.10-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY app/ app/
COPY models/ models/
COPY preprocessing/ preprocessing/

# Set environment variable for port
ENV PORT=7860

EXPOSE 7860

CMD ["gunicorn", "--workers", "1", "--timeout", "180", "--bind", "0.0.0.0:7860", "app.main:app"]
Overwriting backend/Dockerfile

Try to find way to make PORT number dynamically ie not hard code 7860 everywhere

Gist

Section What's Good
python:3.10-slim ✅ Lightweight base image
ENV PORT=7860 ✅ Flexible port declaration
EXPOSE ${PORT} ✅ Dynamically exposes the correct port
gunicorn usage ✅ Better for production-like demo
--workers 1 ✅ Hugging Face Spaces = low-concurrency; one worker is fine
--timeout 180 ✅ Helpful if your model loads slowly
"app.main:app" ✅ Correct WSGI path (i.e., app = Flask(...) in app/main.py)

🔍 Just One Check

Be sure this exists in app/main.py:

app = Flask(__name__)

If that line is wrapped in a function like create_app(), then Gunicorn won’t work unless you adjust the WSGI path.

Alternative

FROM python:3.10-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY app/ app/
COPY models/ models/

ENV PORT=8000

# Add memory limit for ML model
ENV MAX_WORKERS=1

EXPOSE ${PORT}

CMD ["sh", "-c", "gunicorn --timeout 180 -w ${MAX_WORKERS} -b 0.0.0.0:${PORT} app.main:app"]

Upload backend module to HF Space¶

In [373]:
# HF Config
access_token = "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"
repo_id = "nipunshah/superkart-backend"

# Login to Hugging Face
login(token=access_token)

# Initialize API
api = HfApi()

# Upload backend folder
api.upload_folder(
    folder_path="backend",    # Our local backend folder
    repo_id=repo_id,         # Your space name
    repo_type="space"        # Type is space
)
Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /Users/nipunshah/.cache/huggingface/token
Login successful
Out[373]:
'https://huggingface.co/spaces/nipunshah/superkart-backend/tree/main/'
  • Backend Module is uploaded Successfully
  • Space Url: https://huggingface.co/spaces/nipunshah/superkart-backend/

NOTE:

Successful deployment logs would look something like this in container tab

===== Application Startup at 2025-07-04 17:55:31 =====

[2025-07-04 17:55:43 +0000] [1] [INFO] Starting gunicorn 23.0.0
[2025-07-04 17:55:43 +0000] [1] [INFO] Listening at: http://0.0.0.0:7860 (1)
[2025-07-04 17:55:43 +0000] [1] [INFO] Using worker: sync
[2025-07-04 17:55:43 +0000] [7] [INFO] Booting worker with pid: 7
In [374]:
def test_api(data: dict, endpoint: str = 'predict') -> None:
    """
    Test SuperKart API endpoints

    Args:
        data: Dictionary of features for prediction
        endpoint: 'predict' or 'predict_batch'
        base_url: HF space URL (default: your space URL)
    """

    base_url = "https://nipunshah-superkart-backend.hf.space"

    url = f"{base_url}/{endpoint}"

    try:
        # Make API call
        response = requests.post(url, json=data)

        print(f"Status Code: {response.status_code}")
        print("\nResponse:")
        print(json.dumps(response.json(), indent=2))

    except Exception as e:
        print(f"Error: {str(e)}")
In [ ]:
test_data = {
    'Product_Weight': 100,
    'Product_MRP': 50,
    'Product_Allocated_Area': 200,
    'Store_Establishment_Year': 2020,
    'Product_Sugar_Content': 'Regular',
    'Product_Type': 'Dairy',
    'Store_Size': 'Medium',
    'Store_Location_City_Type': 'Tier 1',
    'Store_Type': 'Supermarket Type1'
}
test_api(test_data)
Status Code: 200

Response:
{
  "status": "success",
  "predicted_sales": 4192.071391218717,
  "message": "Prediction successful"
}

✨ Hurray, Verified ✅

FrontEnd¶

Structure

frontend/
├── app.py             # Streamlit application
├── requirements.txt   # Dependencies
└── Dockerfile        # Container configuration

Let see first unique values for each catgeorical enums

In [377]:
def get_list_of_unique_values(df, column_name):
    """
    Get list of unique values for a given column in a DataFrame

    Args:
        df: DataFrame containing the column
        column_name: Name of the column to get unique values from

    Returns:
        List of unique values
    """
    return df[column_name].unique().tolist()
In [382]:
product_types = get_list_of_unique_values(df, 'Product_Type')
product_types
Out[382]:
['Frozen Foods',
 'Dairy',
 'Canned',
 'Baking Goods',
 'Health and Hygiene',
 'Snack Foods',
 'Meat',
 'Household',
 'Hard Drinks',
 'Fruits and Vegetables',
 'Breads',
 'Soft Drinks',
 'Breakfast',
 'Others',
 'Starchy Foods',
 'Seafood']
In [386]:
product_sugar_contents = get_list_of_unique_values(df, 'Product_Sugar_Content')
product_sugar_contents.remove('reg')
product_sugar_contents
Out[386]:
['Low Sugar', 'Regular', 'No Sugar']
In [384]:
store_sizes = get_list_of_unique_values(df, 'Store_Size')
store_sizes
Out[384]:
['Medium', 'High', 'Small']
In [385]:
store_location_city_types = get_list_of_unique_values(df, 'Store_Location_City_Type')
store_location_city_types
Out[385]:
['Tier 2', 'Tier 1', 'Tier 3']
In [388]:
store_types = get_list_of_unique_values(df, 'Store_Type')
store_types
Out[388]:
['Supermarket Type2', 'Departmental Store', 'Supermarket Type1', 'Food Mart']

Create Frontend main root directory

In [376]:
# Create frontend directory
os.makedirs("frontend", exist_ok=True)

app.py (main file for streamlit | entry)

In [428]:
%%writefile frontend/app.py
import streamlit as st
import requests
import pandas as pd
import datetime
import io

# Base URL for the backend API
BASE_URL = "https://nipunshah-superkart-backend.hf.space"

# Set page title
st.title("SuperKart Sales Prediction")

# Create tabs for single prediction and batch prediction
tab1, tab2 = st.tabs(["Single Prediction", "Batch Prediction"])

product_types = [
    "Frozen Foods",
    "Dairy",
    "Canned",
    "Baking Goods",
    "Health and Hygiene",
    "Snack Foods",
    "Meat",
    "Household",
    "Hard Drinks",
    "Fruits and Vegetables",
    "Breads",
    "Soft Drinks",
    "Breakfast",
    "Others",
    "Starchy Foods",
    "Seafood",
]
product_sugar_contents = ["Low Sugar", "Regular", "No Sugar"]
store_sizes = ["Small", "Medium", "High"]
store_location_city_types = ["Tier 1", "Tier 2", "Tier 3"]
store_types = [
    "Supermarket Type2",
    "Departmental Store",
    "Supermarket Type1",
    "Food Mart",
]

with tab1:
    st.subheader("Predict Sales for Single Product")

    # Numeric inputs
    product_weight = st.number_input("Product Weight", min_value=0.0, step=0.01)
    product_mrp = st.number_input("Product MRP", min_value=0.0, step=0.1)
    product_allocated_area = st.number_input(
        "Product Allocated Area (0 to 1)",
        min_value=0.0,
        max_value=1.0,
        value=0.5,
        step=0.01,
        format="%.2f",  # shows 2 decimal places
    )
    store_est_year = st.number_input(
        "Store Establishment Year",
        min_value=1900,
        max_value=datetime.datetime.now().year,
        step=1,
    )

    # Enumerated Kinda Input ---

    product_sugar_content = st.selectbox(
        "Product Sugar Content", product_sugar_contents
    )

    product_type = st.selectbox("Product Type", product_types)

    store_size = st.selectbox("Store Size", store_sizes)

    store_location_city_type = st.selectbox(
        "Store Location City Type", store_location_city_types
    )

    store_type = st.selectbox("Store Type", store_types)

    # Predict button
    if st.button("Predict Sales"):
        # Prepare data
        data = {
            "Product_Weight": product_weight,
            "Product_MRP": product_mrp,
            "Product_Allocated_Area": product_allocated_area,
            "Store_Establishment_Year": store_est_year,
            "Product_Sugar_Content": product_sugar_content,
            "Product_Type": product_type,
            "Store_Size": store_size,
            "Store_Location_City_Type": store_location_city_type,
            "Store_Type": store_type,
        }

        with st.spinner("Predicting..."):
            # Make prediction request
            try:
                response = requests.post(f"{BASE_URL}/predict", json=data)
                result = response.json()

                if result["status"] == "success":
                    st.success(f"Predicted Sales: ₹{result['predicted_sales']:.2f}")
                else:
                    st.error(f"Prediction failed: {result['message']}")
            except Exception as e:
                st.error(f"Error making prediction: {str(e)}")

with tab2:
    st.subheader("Predict Sales for Multiple Products")

    # File upload
    uploaded_file = st.file_uploader("Upload CSV file", type=["csv"])

    if uploaded_file is not None:
        try:
            # Read the file into a DataFrame
            # Create a copy of the file data
            #file_contents = uploaded_file.getvalue()

            # # Use one copy for preview
            # df = pd.read_csv(io.BytesIO(file_contents))
            # st.write("Preview of uploaded data:")
            # st.write(df.head())

            # Preview the uploaded data
            df = pd.read_csv(uploaded_file)
            st.write("Preview of uploaded data:")
            st.write(df.head())

            # Check required columns
            required_cols = [
                "Product_Weight",
                "Product_MRP",
                "Product_Allocated_Area",
                "Store_Establishment_Year",
                "Product_Sugar_Content",
                "Product_Type",
                "Store_Size",
                "Store_Location_City_Type",
                "Store_Type",
            ]

            missing_cols = [col for col in required_cols if col not in df.columns]

            if missing_cols:
                st.error(f"Missing required columns: {', '.join(missing_cols)}")
            else:
                # Make prediction button
                if st.button("Predict Sales for All Products"):
                    with st.spinner("Predicting..."):
                        # Make prediction request
                        try:
                            # Reset file pointer and create a new file-like object
                            uploaded_file.seek(0) # as we read it earlier for preview

                            # Create a named temporary file-like object
                            files = {
                                "file": ("batch_predict.csv", uploaded_file, "text/csv")
                            }

                            # files = {"file": uploaded_file}
                            response = requests.post(
                                f"{BASE_URL}/predict_batch", files=files
                            )
                            result = response.json()

                            if result["status"] == "success":
                                # Add predictions to dataframe
                                df["Predicted_Sales"] = result["predictions"]
                                st.success("Predictions completed successfully!")
                                st.write("Results:")
                                st.write(df)

                                # Download button for results
                                csv = df.to_csv(index=False)
                                st.download_button(
                                    "Download Results",
                                    csv,
                                    "predictions.csv",
                                    "text/csv",
                                    key="download-csv",
                                )
                            else:
                                st.error(f"Prediction failed: {result['message']}")
                        except Exception as e:
                            st.error(f"Error making prediction: {str(e)}")
        except Exception as e:
            st.error(f"Error reading file: {str(e)}")
Overwriting frontend/app.py

Requirements

In [391]:
%%writefile frontend/requirements.txt
streamlit
requests
pandas
Writing frontend/requirements.txt

Docker

In [426]:
%%writefile frontend/Dockerfile
FROM python:3.9-slim

WORKDIR /app

# Install system dependencies (needed by streamlit sometimes)
RUN apt-get update && apt-get install -y build-essential

COPY requirements.txt .
RUN pip install -r requirements.txt

# ! (MUST NEEDED OTHERWISE FILE UPLOAD WONT WORK)
# Create .streamlit directory with correct permissions
RUN mkdir -p /app/.streamlit && chmod 777 /app/.streamlit

# Copy all files from current directory to container's /app directory
COPY . .

# Setup .streamlit config # ! (MUST NEEDED OTHERWISE FILE UPLOAD WONT WORK)
RUN mkdir -p /app/.streamlit
RUN echo "\
[server]\n\
headless = true\n\
port = 7860\n\
enableCORS = false\n\
\n\
[browser]\n\
gatherUsageStats = false\n\
" > /app/.streamlit/config.toml

EXPOSE 7860

# Important: prevent event loop crash by setting uvicorn policy
ENV PYTHONASYNCIODEBUG=1  # ! (MUST NEEDED OTHERWISE FILE UPLOAD WONT WORK)

CMD ["streamlit", "run", "app.py", "--server.port=7860", "--server.address=0.0.0.0", "--server.enableXsrfProtection=false"]
# NOTE: Disable XSRF protection for easier external access in order to make batch predictions
Overwriting frontend/Dockerfile

Upload to Hugging Face Hub

In [429]:
# # HF Config
# access_token = "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"
repo_id = "nipunshah/superkart-frontend"

# # Login to Hugging Face
# login(token=access_token)

# # Initialize API
# api = HfApi()

# Upload backend folder
api.upload_folder(
    folder_path="frontend",    # Our local frontend folder
    repo_id=repo_id,         # Your space name
    repo_type="space"        # Type is space
)
Out[429]:
'https://huggingface.co/spaces/nipunshah/superkart-frontend/tree/main/'

Sample csv for batch prediction

In [405]:
os.makedirs("testset", exist_ok=True)
In [406]:
%%writefile testset/test_batch.csv
Product_Weight,Product_MRP,Product_Allocated_Area,Store_Establishment_Year,Product_Sugar_Content,Product_Type,Store_Size,Store_Location_City_Type,Store_Type
12.66,117.08,0.027,2009,Low Sugar,Frozen Foods,Medium,Tier 2,Supermarket Type2
16.54,171.43,0.144,1999,Regular,Dairy,Medium,Tier 1,Departmental Store
14.28,162.08,0.031,1987,Regular,Canned,High,Tier 2,Supermarket Type1
9.57,123.67,0.01,1998,No Sugar,Health and Hygiene,Small,Tier 3,Food Mart
13.92,116.89,0.099,2009,Low Sugar,Fruits and Vegetables,Medium,Tier 2,Supermarket Type2
Writing testset/test_batch.csv
In [414]:
def test_batch_inference():
    # Base URL of our backend
    base_url = "https://nipunshah-superkart-backend.hf.space"

    # Read the sample CSV
    #df = pd.read_csv("testset/test_batch.csv")

    # Open the file and send as multipart/form-data
    files = {
        'file': open('testset/test_batch.csv', 'rb')
    }

    # Make POST request to batch prediction endpoint
    response = requests.post(
        f"{base_url}/predict_batch",
        #json=df.to_dict(orient="records")
        files=files
    )

    # Display results
    if response.status_code == 200:
        response = response.json()
        predictions = response['predictions']
        print("Predictions:", predictions)
        # Save predictions to CSV
        # predictions_df = pd.DataFrame(predictions)
        # predictions_df.to_csv('predictions.csv', index=False)
        # print("Predictions saved to predictions.csv")
    else:
        print("Error:", response.text)
In [415]:
test_batch_inference()
Predictions: [2876.456053269243, 4846.866048789477, 4139.998187436777, 2044.3859239158705, 3140.2883973846215]

✨ Hurray, Verified ✅

ScreenShots¶

In [15]:
from IPython.display import Image, display, HTML
In [14]:
Image(filename='../screenshots/projectspecifics/frontend_single1.png')
Out[14]:
In [19]:
Image(filename='../screenshots/projectspecifics/frontend_single2.png')
Out[19]:
In [20]:
Image(filename='../screenshots/projectspecifics/frontend_batch_1.png')
Out[20]:
In [21]:
Image(filename='../screenshots/projectspecifics/frontend_batch_2.png')
Out[21]:

Link to the Space:

  • https://huggingface.co/spaces/nipunshah/superkart-backend
  • https://huggingface.co/spaces/nipunshah/superkart-frontend

Actionable Insights & Recommendations¶

🔍 Insights:¶

🏪 Store & Location Dynamics¶

  • Mid-sized stores dominate: 69% of product-store interactions occur in medium-sized stores, highlighting SuperKart's core operational format.
  • Clear scale benefit: High-sized stores generate 2.2x more sales than small stores (₹3,924 vs ₹1,763), confirming economies of scale.
  • Small stores underperform: The drastic gap between small and medium stores suggests different business models, not just scaling differences.
  • Tier 2 saturation: 71% of SuperKart's stores are in Tier 2 cities, showing a strategy aimed at rising urban middle class.
  • Tier 1 stores outperform: Despite fewer locations, Tier 1 stores significantly outperform Tier 2 and Tier 3, highlighting quality over quantity.
  • Tier 3 ≈ Small stores: Tier 3 cities mimic small store dynamics, hinting at infrastructural or demand limitations.

🛒 Format & Product-Type Observations¶

  • Supermarket Type2 leads: Over 53% of operations are under this format, suggesting it's SuperKart’s operational backbone.
  • Departmental Stores excel: Outperform others across categories, especially for breads, breakfast, and staple foods, likely due to routine/impulse buying.
  • Food Marts lag: These formats underperform, possibly due to limited assortment or footfall.

🍞 Product Insights¶

  • Staple items dominate: Categories like Starchy Foods, Breads, Dairy, and Seafood consistently deliver higher average sales.
  • Breakfast sensitive to space: Performs poorly in small stores, suggesting impulse-driven or layout-dependent behavior.
  • Beverages underperform: Hard Drinks see the lowest traction, and Soft Drinks are mid-tier, hinting at unique beverage consumption trends.

💸 Pricing & Allocation¶

  • Premium Pricing works: Sales increase nearly 5× from low to high MRP bands, indicating that SuperKart’s premium pricing is effective and accepted by the market, confirming that higher MRP = higher returns.
  • Display area has little impact: Contrary to traditional retail logic, allocation space does not significantly impact sales at SuperKart - suggesting an already optimized layout.

📌 Recommendations¶

🏬 Store & Market Strategy¶

  • Double down on medium & high-sized stores: they strike the best balance between cost and revenue. Moreover, Medium stores (69% of operations) show optimal cost-benefit ratio.
  • Tier 1 expansion: Although fewer, these locations generate the highest revenue per unit. Consider selective premium expansion here.
  • Reevaluate small formats and Tier 3 operations, possibly rethink their role (e.g., micro-fulfillment or niche SKUs).
  • Explore Departmental Store format growth, consistently high performers in multiple product types.
  • Implement tier-specific premium product strategies
  • Standardize successful store operations
  • Develop data-driven inventory management.

📦 Product & Inventory Strategy¶

  • Prioritize staples like Starchy Foods, Breads, Dairy, and Seafood - high and consistent performers across store types.
  • Bundle or promote breakfast items in larger stores. Smaller formats are ineffective in capturing their value.
  • Audit underperforming beverages (esp. Hard Drinks) to identify marketing, licensing, or supply-side limitations.

🎯 Pricing & Merchandising¶

  • Maintain and promote premium SKUs: price-to-sales relationship confirms consumer readiness to pay for value.
  • Don’t overinvest in allocation area: flat sales trend across space ranges implies layout is already efficient.
  • Develop data-driven inventory management

🚀 Growth Strategy¶

  • Quality locations outperform quantity strategy
  • Premium products show proportional returns
  • Store format significantly impacts category performance

Recommendations:

  • Focus on strategic location selection over rapid expansion
  • Develop premium private label products
  • Create format-specific success blueprints